Problem parsing content from medium.com

boncey commented 10 years ago

I've found that ruby-readability seems to have problems with a couple of blog posts on Medium seem (it may be more, I've only tested two).

https://medium.com/our-addictions/ae81e19b0289 & https://medium.com/on-product-management/926ab5c39156.

ruby-readability seems to see the various paragraphs (divided by <hr class="section-divider"> as separate sections and then it picks the one with the highest score.

readability --debug https://medium.com/our-addictions/ae81e19b0289
Top 5 candidates:
 Candidate div#.section-inner layout-single-column with score 45.0
 Candidate div#.section-inner layout-single-column with score 42.0
 Candidate div#.section-inner layout-single-column with score 42.0
 Candidate div#.section-inner layout-single-column with score 40.0
 Candidate div#.section-inner layout-single-column with score 37.0
 Best candidate div#.section-inner layout-single-column with score 45.0

It then just shows the text from what looks like the longest paragraph.

Is there likely to be an easy fix for things like this - or some way of working around it for specific sites? Instapaper used to have a custom parser with user-contributed rules (although it seems to have gone away since it was sold). Have there been any thoughts about doing that sort of thing for ruby-readability at all?

Thanks, Darren.

papriwalprateek commented 10 years ago

This is a very important issue in ruby-readability actually.

I think that there should be a threshold variance till which the candidates should be selected. For ex- With best candidate score as 45, the candidates which come under the threshold variance(say 10) should be included. Thus all articles with score greater than 35(45-10) be selected.

cantino commented 10 years ago

Hey @boncey, I'd love to see custom parsers, but I don't have the time to maintain a repository of them. If anyone is aware of other libraries that support this, I'll gladly link to them. Also happy to merge contributions and code improvements!

cantino commented 10 years ago

I would definitely accept a pull request (with specs) that adds this feature.

On Tuesday, February 11, 2014, naotohc notifications@github.com wrote:

Oh, I didn't understand that the ref from private repo is linked =< .

Reply to this email directly or view it on GitHubhttps://github.com/cantino/ruby-readability/issues/62#issuecomment-34830494 .

ghost commented 10 years ago

Hello, yeah, maybe I would if I decide it really an issue in my developing service.

glaszig commented 10 years ago

i don't think dedicated parsers are the way to go. rather the scoring system needs a little tuning.

the issue with medium.com (the first example from above) is, that the main post is cascaded into multiple structural containers which leads to the parent/grand-parent scoring being useless because a grand-grand-parent or grand-grand-grand-parent would be needed. but this'll get ridiculous.

i'm thinking about scoring nodes based on inner_text length and id, class attributes (status quo) and then look which parent node -- up until "a root node" -- contains the most highly scored nodes and return this container node (cleaned up) as the content.

makes sense?

cantino commented 10 years ago

We can play with it. Maybe pluggable scoring systems? This library's scoring is based on the original Readability library, so that should likely stay the default scoring system?

pagojo commented 10 years ago

On 26 Apr 2014, at 00:34, Andrew Cantino wrote:

We can play with it. Maybe pluggable scoring systems? This library's scoring is based on the original Readability library, so that should likely stay the default scoring system?

Yes, however we need somehow to clearly define what this 'original readability scoring system' is.

Another such scoring system would be one which could extract just the comments of a page..

/pagojo

http://www.missum.com http://www.pagonis.org https://twitter.com/JohnPagonis http://otbconf.org

kovson commented 8 years ago

Problem still exists, are there any possible solutions?

cantino commented 8 years ago

I'm not actively using this library anymore and so aren't working on further development. I'm very open to pull requests and improvements, however, and would happily add another maintainer if someone wants to get involved and push it forward.

cantino / ruby-readability

Problem parsing content from medium.com #62