cantino / ruby-readability

Port of arc90's readability project to Ruby
Apache License 2.0
925 stars 170 forks source link

TheGuardian.com image galleries are not picked up #78

Open thom4parisot opened 9 years ago

thom4parisot commented 9 years ago

Hello,

I tried to apply readability on a specific layout of The Guardian, which heavily relies on JavaScript but still has most of the text available in the HTML source code:

http://www.theguardian.com/football/gallery/2014/sep/10/memory-lane-1980s-footballers-at-home-in-pictures

Readability returned this chunk of HTML:

<div><div> comments <p>Sign in or create your Guardian account to join the discussion. </p> <p>This discussion is closed for comments.</p> <p> We’re doing some maintenance right now. You can still read comments, but please come back later to add your own. </p> <p> Commenting has been disabled for this account (why?) </p> </div></div>

Do you know guys why the main content is not properly extracted, and if it fixable?

thom4parisot commented 9 years ago

I tried on a browser rendered HTML content, and I got this instead:

<div><div><p>The Guardian’s picture editors bring you a selection of the best photographs from around the world, including commemorations in Paris and Jerusalem, a bus strike in London, and the Makar Sankranti festival in India  </p> </div></div>
cantino commented 9 years ago

Hey @oncletom. Readability is heuristic-based, so while it works on many (most?) sites, it doesn't work in every single case. You can try to tweak the algorithm's parameters and see if you find a configuration that works better for you. There is also https://readability.com/developers/api