fb55 / readabilitySAX

a fast and platform independent readability port (JS)
BSD 2-Clause "Simplified" License
245 stars 36 forks source link

Handling double images #30

Open mrjjwright opened 12 years ago

mrjjwright commented 12 years ago

Many sites have an enlarge javascript function to see a larger image. E.g. NPR often does this: http://www.npr.org/blogs/therecord/2012/08/22/159534467/my-american-dream-sounds-like-black-star?sc=fb&cc=fmp.

Safari Reader and Readability manage to filter out the double images somehow but readabilitySAX is showing the same image twice at the top of the article. I am trying to figure out how to fix this but thought I would post an issue until I figure it out.

fb55 commented 12 years ago

That's the nice thing when you've got a rendered page: You may simply remove everything that isn't visible (for jQuery users: $(":not(:visible)").remove()).

I can't tell by looking at the markup which parts will be hidden, so I'm closing this as I don't see a way to fix it.

mrjjwright commented 12 years ago

Well what about this in onclosetag? The idea here would be to only allow one image per level in the readable article. I tested this and it works well on NPR articles except it doesn't remove duplicate captions.

if (tagName === 'img') {
  // don't double up on showing  images
  if (this._imgElem) {
    elem.parent.isCandidate = false;
    // skip the img
    return;
}
  else this._imgElem = elem.parent;   
}

if (this._imgElem === elem) {
  this._imgElem = null;
}
fb55 commented 12 years ago

This would solve this particular problem, but would raise others: Just think about image galeries. And I've seen websites that wrapped their content in <pre> tags, with image tags at random locations. I don't want to break these, even though their markup is terrible. But you may argue the same for npr.org.

Besides, isCandidate only signals that an element is a candidate for being the container of the article ;)

fb55 commented 12 years ago

Okay, first, you should use a label for this code. The boolean isn't needed and complicates everything. Besides, you remove all nodes that aren't <img>s, which isn't such a good idea.

I guess the best variant would be to filter images based on their URLs, in a similar way as links are already handled. When you find some usable rules for comparing URLs, that would be great.

mrjjwright commented 12 years ago

I got rid of that code.

I instead keep track of the largest image within each div and if the div is too low on contentLength simply use that image as the div's children. This is taking care of enlarge divs and the like. I filter out anything with slideshow explicitly. It's working well.

On Aug 24, 2012, at 5:32 AM, Felix Böhm notifications@github.com wrote:

Okay, first, you should use a label for this code. The boolean isn't needed and complicates everything. Besides, you remove all nodes that aren't s, which isn't such a good idea.

I guess the best variant would be to filter images based on their URLs, in a similar way as links are already handled. When you find some usable rules for comparing URLs, that would be great.

— Reply to this email directly or view it on GitHub.