Open louismullie opened 12 years ago
A similar behaviour can be experienced with the following HTML (http://www.economist.com/node/21548244) :
<h2 class="fly-title">Campaign finance</h2>
<h3 class="headline">The hands that prod, the wallets that feed</h3>
<h1 class="rubric">Super PACs are changing the face of American politics. </h1>
None of the H1, H2, H3 tags get retrieved even when I specify them in the :tags
option.
What does the JS version of Readability do on those pages?
If that's what you mean, the Readability API correctly parses the pages : http://www.readability.com/articles/urlh3i3g, http://www.readability.com/articles/l2exnq9u.
If you can point me toward the right direction in the code, I can make a patch and I'll send you a pull request.
They must have revised the Readability code since I last ported it. You'll need to walk through the JavaScript and compare it to what the Ruby is doing. I'm not actively using ruby-readability in any current projects, so I haven't had time to do this myself. It'd be excellent if you want to give it a shot.
Alrighty, I'll see what I can do when I have some time.
This seems to have more to do with where ruby-readability
decides where the content of the page lies than what tags it is accepting.
Try
source = open('http://en.wikipedia.org/wiki/Frimley_Green_Windmill').read
puts Readability::Document.new(source, tags: ['h1', 'h2', 'p', 'div']).content # added 'h2'
and you will see the h2
s from that page.
I haven't dug into the source enough to see why, but it doesn't seem to be a headline issue at least. The markup on the economist page doesn't seem super helpful to a generic library like this. I wonder how they do it now (where they are catching these)...
The problem appears when h1
elements are contained outside the best candidate. This is an example:
<div id="container">
<div id="article">
<h1>Main title</h1>
<div id="content">
<h2>Section title</h2>
<p>content</p>
<p>content</p>
<h2>Section title</h2>
<p>content</p>
</div>
</div>
</div>
The #content
element will always have a better score than #article
because it always has an higher link density (same number of links, less content). The h1
in #article
will thus never be included in the result. This confirms the idea of @pferdefleisch.
A possible solution may be to increase the score of an element if it contains many non-excluded elements. This will increase the score of the #article
element because it will include strictly more accepted tags than #content
.
That's interesting. If you want to propose a pull request, that seems like a reasonable solution unless it breaks a lot of specs/behaviors.
I have been experimenting with the gem to retrieve content from Wikipedia pages, but it seems that the H1 tags get lost during the process of text extraction:
Output:
This is missing the only h1 tag on the page,
I have experienced the same quirk with all Wikipedia pages. Any idea what could be causing this?