fb55 / readabilitySAX

a fast and platform independent readability port (JS)
BSD 2-Clause "Simplified" License
245 stars 36 forks source link

Article images need better detection #29

Open mrjjwright opened 12 years ago

mrjjwright commented 12 years ago

There are a few times where Safari Reader is doing a better job of leaving in article images that are filtered out by readabilitySAX. Here is an example: http://hommemaker.com/2012/08/20/why-the-gays-hate-their-bodies/. Compare the Safari Reader rendering with readabilitySAX. In this case readabilitySAX should preserve images that are wrapped inside a a parent and p grandparent tag. The general rule might be that if there is a single image of sufficient size with any number of wrapping tags these images are candidates. There is probably a better general rule, that is just my take on it.

fb55 commented 12 years ago

The problem is that banner ads often are big images inside an a tag. I was really annoyed by the number of banner ads I got, so I added this rule. In retrospective, it looks a bit harsh.

The ideal solution in terms of the result would be to use a list of ads and check every image if it matches a rule (Adblock Plus-alike). But this would probably harm the performance in a terrible way and also requires to be updated quite often.

Another option would be to filter images based on their aspect ratio. But not all images have their width & height specified, which complicates this.

I guess we'll have to live with either banner ads or missing images. Missing images might hamper the understanding of an article, while banner ads can be ignored. The choice seems to be pretty obvious, so I'll change the behavior of readabilitySAX soon.

kof commented 10 years ago

The easiest way is to make it optional and to provide optional middleware for image size detection. I can share my private code which is doing this as example. Reading just view bytes from images makes it relatively fast.