cantino / ruby-readability

Port of arc90's readability project to Ruby
Apache License 2.0
925 stars 171 forks source link

Allow passing in an array of elements_to_score and add 'pre' as a default #93

Closed tuzz closed 1 year ago

tuzz commented 1 year ago

Allow passing in an array of elements_to_score and add 'pre' as a default

We were experiencing a problem where the h1 text was not being included in the Readability#content. Here is an example that demonstrates the problem:

<article>
  <header>
    <h1>Title</h1>
  </header>
  <section>
    <p>Paragraph</p>
  </section>
</article>

Previously, the code would add the <p>, <section> and <article> elements as @candidates because it adds the parent and grand parent of every <p>. It would not add the <header> element as a candidate.

Then, the best_candidate with the highest score is the <section> element. The code then tries to add related siblings in #get_article but it wasn't adding the <header> element because it wasn't in the list of candidates.

We can solve this problem by adding <h1> to the list of elements to score which will then ensure that <header> parent is included in the candidates and can be added as a related sibling.

This commit also adds <pre> to the list of default nodes to score because it is included in arc90's original code here:

https://github.com/masukomi/arc90-readability/blob/master/js/readability.js#L749

I'm not sure why this was omitted.

tuzz commented 1 year ago

Superseded by https://github.com/cantino/ruby-readability/pull/94 which sets a different source branch.