cantino / ruby-readability

Port of arc90's readability project to Ruby
Apache License 2.0
919 stars 170 forks source link

Allow passing in an array of elements_to_score and add 'pre' as a default #94

Closed tuzz closed 8 months ago

tuzz commented 8 months ago

Allow passing in an array of elements_to_score and add 'pre' as a default

We were experiencing a problem where the h1 text was not being included in the Readability#content. Here is an example that demonstrates the problem:

<article>
  <header>
    <h1>Title</h1>
  </header>
  <section>
    <p>Paragraph</p>
  </section>
</article>

Previously, the code would add the <p>, <section> and <article> elements as @candidates because it adds the parent and grand parent of every <p>. It would not add the <header> element as a candidate.

Then, the best_candidate with the highest score is the <section> element. The code then tries to add related siblings in #get_article but it wasn't adding the <header> element because it wasn't in the list of candidates.

We can solve this problem by adding <h1> to the list of elements to score which will then ensure that <header> parent is included in the candidates and can be added as a related sibling.

This commit also adds <pre> to the list of default nodes to score because it is included in arc90's original code here:

https://github.com/masukomi/arc90-readability/blob/master/js/readability.js#L749

I'm not sure why this was omitted. Perhaps remove remove code blocks? Furthermore, this pull request adds a second commit that attempts to solve a follow-on problem:

The code had two strategies for determining whether to include siblings in the output after determining the best candidate based on score:

1) It checked if the sibling is a candidate that scored above a threshold which is the maximum of 10 and 0.2 of the best_candidate’s score.

2) It checked if the sibling was a paragraph that was longer than 80 characters with a penalty given for each link within the paragraph.

Neither of these strategies worked well for extracting <h1> titles:

1) Failed because titles score poorly due to not containing many commas 2) Failed because titles are within <h1> or <header> elements

However, titles are usually longer than 80 characters and don’t contain links so it seems reasonable to modify strategy 2) to allow for other elements, such as <h1> and <header> to be included as related siblings.

Therefore, this commit introduces a :likely_silings option that defaults to the same <p> elements as before but can now be set by the developer to include other elements such as <h1> and <header>. These are not added by default to remain in sync with Arc90’s original implementation.

cantino commented 8 months ago

Thanks @tuzz!

nattsw commented 1 month ago

@cantino will you be releasing a new version of this on https://rubygems.org/gems/ruby-readability? 🤗

cantino commented 4 weeks ago

Done, 0.7.1.