Open thom4parisot opened 9 years ago
I tried on a browser rendered HTML content, and I got this instead:
<div><div><p>The Guardian’s picture editors bring you a selection of the best photographs from around the world, including commemorations in Paris and Jerusalem, a bus strike in London, and the Makar Sankranti festival in India </p> </div></div>
Hey @oncletom. Readability is heuristic-based, so while it works on many (most?) sites, it doesn't work in every single case. You can try to tweak the algorithm's parameters and see if you find a configuration that works better for you. There is also https://readability.com/developers/api
Hello,
I tried to apply readability on a specific layout of The Guardian, which heavily relies on JavaScript but still has most of the text available in the HTML source code:
Readability returned this chunk of HTML:
Do you know guys why the main content is not properly extracted, and if it fixable?