Previously, the code would add the <p>, <section> and <article> elements
as @candidates because it adds the parent and grand parent of every <p>. It
would not add the <header> element as a candidate.
Then, the best_candidate with the highest score is the <section> element.
The code then tries to add related siblings in #get_article but it wasn't
adding the <header> element because it wasn't in the list of candidates.
We can solve this problem by adding <h1> to the list of elements to score
which will then ensure that <header> parent is included in the
candidates and can be added as a related sibling.
This commit also adds <pre> to the list of default nodes to score
because it is included in arc90's original code here:
I'm not sure why this was omitted. Perhaps remove remove code blocks?
Furthermore, this pull request adds a second commit that attempts to solve a follow-on problem:
The code had two strategies for determining whether to include siblings
in the output after determining the best candidate based on score:
1) It checked if the sibling is a candidate that scored above a threshold
which is the maximum of 10 and 0.2 of the best_candidate’s score.
2) It checked if the sibling was a paragraph that was longer than 80
characters with a penalty given for each link within the paragraph.
Neither of these strategies worked well for extracting <h1> titles:
1) Failed because titles score poorly due to not containing many commas
2) Failed because titles are within <h1> or <header> elements
However, titles are usually longer than 80 characters and don’t contain
links so it seems reasonable to modify strategy 2) to allow for other
elements, such as <h1> and <header> to be included as related siblings.
Therefore, this commit introduces a :likely_silings option that defaults
to the same <p> elements as before but can now be set by the developer
to include other elements such as <h1> and <header>. These are not
added by default to remain in sync with Arc90’s original implementation.
Allow passing in an array of elements_to_score and add 'pre' as a default
We were experiencing a problem where the h1 text was not being included in the Readability#content. Here is an example that demonstrates the problem:
Previously, the code would add the
<p>
,<section>
and<article>
elements as@candidates
because it adds the parent and grand parent of every<p>
. It would not add the<header>
element as a candidate.Then, the best_candidate with the highest score is the
<section>
element. The code then tries to add related siblings in #get_article but it wasn't adding the<header>
element because it wasn't in the list of candidates.We can solve this problem by adding
<h1>
to the list of elements to score which will then ensure that<header>
parent is included in the candidates and can be added as a related sibling.This commit also adds
<pre>
to the list of default nodes to score because it is included in arc90's original code here:https://github.com/masukomi/arc90-readability/blob/master/js/readability.js#L749
I'm not sure why this was omitted. Perhaps remove remove code blocks? Furthermore, this pull request adds a second commit that attempts to solve a follow-on problem:
The code had two strategies for determining whether to include siblings in the output after determining the best candidate based on score:
1) It checked if the sibling is a candidate that scored above a threshold which is the maximum of 10 and 0.2 of the best_candidate’s score.
2) It checked if the sibling was a paragraph that was longer than 80 characters with a penalty given for each link within the paragraph.
Neither of these strategies worked well for extracting
<h1>
titles:1) Failed because titles score poorly due to not containing many commas 2) Failed because titles are within
<h1>
or<header>
elementsHowever, titles are usually longer than 80 characters and don’t contain links so it seems reasonable to modify strategy 2) to allow for other elements, such as
<h1>
and<header>
to be included as related siblings.Therefore, this commit introduces a
:likely_silings
option that defaults to the same<p>
elements as before but can now be set by the developer to include other elements such as<h1>
and<header>
. These are not added by default to remain in sync with Arc90’s original implementation.