cantino / ruby-readability

Port of arc90's readability project to Ruby
Apache License 2.0
925 stars 170 forks source link

noscript tags which include images are ignored? #53

Closed pagojo closed 10 years ago

pagojo commented 10 years ago

The JS Readability seems to work fine with this kind of code, whereas the Ruby version doesn't.

The following snippet was lifted from AppleInsider

<div class="article-img">
<img src="http://photos.appleinsider.com/v9/images/1x1-white.jpg" width="660" height="317" alt="iPhone Plus" class="lazy" data-original="http://cdn1.appleinsider.com/iphoneplus-130205.jpg"><noscript><img src="http://cdn1.appleinsider.com/iphoneplus-130205.jpg"></noscript>
</div>

To me it seems that Ruby Readability should detect the noscript images and use them in its heuristics (modulo issue #51). Currently it doesn't

pagojo commented 10 years ago

Test case above actually may be due to #51. However the noscript argument remains.

<div align="center">
<div class="article-img">
<img src="http://photos.appleinsider.com/v9/images/1x1-white.jpg" width="660" height="317" alt="iPhone Plus" class="lazy" data-original="http://cdn1.appleinsider.com/iphoneplus-130205.jpg"><noscript><img src="http://cdn1.appleinsider.com/iphoneplus-130205.jpg"></noscript>
</div>
<br><span class="minor2 small gray">Mockup of iPhone with 4.94-inch screen, <a href="http://www.marco.org/2013/01/31/iphone-plus-speculation">created by</a> Marco Arment.</span>
</div>
pagojo commented 10 years ago

worse, there is double counting going on in clean_conditionally(node, candidates, selector) as img is counted both in the script/noscript cases inside a div. this skews the numbers.

pagojo commented 10 years ago

closed in https://github.com/cantino/ruby-readability/commit/a59ebd76d3e3166059b86c34e71aba199ec92925