Open tybenz opened 10 years ago
Hey @tybenz! Interesting idea. Do you want to work on a pull request for that?
Yeah. I'd love to. I don't know enough about the scoring algorithm though. Wondering if you had any ideas on what a good start might be.
No problem. I'd try to write a failing spec, then I'd take a look at score_node
, class_weight
, and REGEXES
and see if something similar could be written to estimate which node is the title.
Also, I want to get something straight. Is it true that you only ever score p tags, td tags, and their parents and grandparents?
https://github.com/cantino/ruby-readability/blob/master/lib/readability.rb#L270-L271
Am I missing something?
Sorry to necro this issue. Yes, that's right @tybenz, it only scores <p>
, <td>
and their parents and grand parents.
Today I opened a pull request to allow you to specify other nodes to score, such as <h1>
elements that might be nested inside a <header>
element which would not be included in the list of candidates. See https://github.com/cantino/ruby-readability/pull/93
Readability pulls its article title from the
title
tag right? Well more often than not, thetitle
tag has a whole lot of other information besides just the title of the article. It usually includes the title of the site itself and sometimes a category.I know the original readability script just grabbed the title, but I'm wondering if this version of the script can be modified to grab the actual title of the article from the markup. It seems as though the scoring system is set up to exclude the header tag that contains the article title.
Example:
In the above example, readability will always grab the content from
.article-content
and not the<article>
tag itself. What can I do to modify the script to grab the whole article, title and all?