Closed boncey closed 5 years ago
This is a very important issue in ruby-readability actually.
I think that there should be a threshold variance till which the candidates should be selected. For ex- With best candidate score as 45, the candidates which come under the threshold variance(say 10) should be included. Thus all articles with score greater than 35(45-10) be selected.
Hey @boncey, I'd love to see custom parsers, but I don't have the time to maintain a repository of them. If anyone is aware of other libraries that support this, I'll gladly link to them. Also happy to merge contributions and code improvements!
I would definitely accept a pull request (with specs) that adds this feature.
On Tuesday, February 11, 2014, naotohc notifications@github.com wrote:
Oh, I didn't understand that the ref from private repo is linked =< .
Reply to this email directly or view it on GitHubhttps://github.com/cantino/ruby-readability/issues/62#issuecomment-34830494 .
Hello, yeah, maybe I would if I decide it really an issue in my developing service.
i don't think dedicated parsers are the way to go. rather the scoring system needs a little tuning.
the issue with medium.com (the first example from above) is, that the main post is cascaded into multiple structural containers which leads to the parent/grand-parent scoring being useless because a grand-grand-parent or grand-grand-grand-parent would be needed. but this'll get ridiculous.
i'm thinking about scoring nodes based on inner_text
length and id
, class
attributes (status quo) and then look which parent node -- up until "a root node" -- contains the most highly scored nodes and return this container node (cleaned up) as the content.
makes sense?
We can play with it. Maybe pluggable scoring systems? This library's scoring is based on the original Readability library, so that should likely stay the default scoring system?
On 26 Apr 2014, at 00:34, Andrew Cantino wrote:
We can play with it. Maybe pluggable scoring systems? This library's scoring is based on the original Readability library, so that should likely stay the default scoring system?
Yes, however we need somehow to clearly define what this 'original readability scoring system' is.
Another such scoring system would be one which could extract just the comments of a page..
/pagojo
http://www.missum.com http://www.pagonis.org https://twitter.com/JohnPagonis http://otbconf.org
Problem still exists, are there any possible solutions?
I'm not actively using this library anymore and so aren't working on further development. I'm very open to pull requests and improvements, however, and would happily add another maintainer if someone wants to get involved and push it forward.
I've found that ruby-readability seems to have problems with a couple of blog posts on Medium seem (it may be more, I've only tested two).
https://medium.com/our-addictions/ae81e19b0289 & https://medium.com/on-product-management/926ab5c39156.
ruby-readability seems to see the various paragraphs (divided by
<hr class="section-divider">
as separate sections and then it picks the one with the highest score.It then just shows the text from what looks like the longest paragraph.
Is there likely to be an easy fix for things like this - or some way of working around it for specific sites? Instapaper used to have a custom parser with user-contributed rules (although it seems to have gone away since it was sold). Have there been any thoughts about doing that sort of thing for ruby-readability at all?
Thanks, Darren.