Previously, the code would add the <p>, <section> and <article> elements
as @candidates because it adds the parent and grand parent of every <p>. It
would not add the <header> element as a candidate.
Then, the best_candidate with the highest score is the <section> element.
The code then tries to add related siblings in #get_article but it wasn't
adding the <header> element because it wasn't in the list of candidates.
We can solve this problem by adding <h1> to the list of elements to score
which will then ensure that <header> parent is included in the
candidates and can be added as a related sibling.
This commit also adds <pre> to the list of default nodes to score
because it is included in arc90's original code here:
Allow passing in an array of elements_to_score and add 'pre' as a default
We were experiencing a problem where the h1 text was not being included in the Readability#content. Here is an example that demonstrates the problem:
Previously, the code would add the
<p>
,<section>
and<article>
elements as@candidates
because it adds the parent and grand parent of every<p>
. It would not add the<header>
element as a candidate.Then, the best_candidate with the highest score is the
<section>
element. The code then tries to add related siblings in #get_article but it wasn't adding the<header>
element because it wasn't in the list of candidates.We can solve this problem by adding
<h1>
to the list of elements to score which will then ensure that<header>
parent is included in the candidates and can be added as a related sibling.This commit also adds
<pre>
to the list of default nodes to score because it is included in arc90's original code here:https://github.com/masukomi/arc90-readability/blob/master/js/readability.js#L749
I'm not sure why this was omitted.