edgi-govdata-archiving / web-monitoring-task-sheets

Experimental new tool for generating weekly analyst task sheets for web monitoring
GNU General Public License v3.0
3 stars 0 forks source link

Rethink use of Readability #9

Open Mr0grog opened 3 years ago

Mr0grog commented 3 years ago

On some pages, Readability succeeds at extracting a textual body, but has done such a poor job it would be better if it failed. For example, consider this change, which scores 1.0 priority:

Screen Shot 2020-12-01 at 12 25 26 PM

View in Scanner: https://monitoring.envirodatagov.org/page/080165d7-873d-4319-9a2f-8e5388a1933b/74127563-9ead-42b2-bf50-b354fbf4d3c5..29cd854c-b629-44ee-bd26-a5deed5b0620 Page in API: https://api.monitoring.envirodatagov.org/api/v0/pages/080165d7-873d-4319-9a2f-8e5388a1933b Left version in API: https://api.monitoring.envirodatagov.org/api/v0/versions/74127563-9ead-42b2-bf50-b354fbf4d3c5 Right version in API: https://api.monitoring.envirodatagov.org/api/v0/versions/29cd854c-b629-44ee-bd26-a5deed5b0620

There are no textual changes in the page body! What happened? Well, the markup changed very slightly, causing Readability to ignore 80% of the page’s content in the newer version (technical details at bottom). So for text comparison purposes, most of the page was removed. That would definitely be significant, if only it had actually happened.

This case is probably extreme, since Readability parsed differently in the two versions. But I also know I’ve seen examples of pages that just have a lot of their main content excluded by Readability in both versions. Unfortunately, I don’t know how widespread or serious the issue is. I think it’s probably not a majority of cases, but I don’t know if it’s 1% or 45%.

We might be better off finding a more conservative method of separating main content from headers/footers/nav/etc.


Specific Explanation of This Readability Failure

The page is mostly made of paragraph-sized bullet points. They used to be all inline in a big container:

<p>An introductory paragraph.</p>
<h2>Section Header</h2>
<ul>
  <li>A Bullet</li>
  <li>Point</li>
</ul>
<h2>Another Section Header</h2>
<ul>
  ...etc...

But are now wrapped in <div>s (which has no visual or textual impact at all):

<div>
  <p>An introductory paragraph.</p>
</div>
<div>
  <h2>Section Header</h2>
  <ul>
    <li>A Bullet</li>
    <li>Point</li>
  </ul>
</div>
<div>
  <h2>Another Section Header</h2>
  <ul>
    ...etc...

Readability is biased against lists (because they are often used for navigation, news feeds, etc.) and if an element is primarily composed of lists, it will tend to throw it out. Because the lists were previously included alongside the introductory text that occurred in normal paragraphs, they were considered part of the content. Once they were isolated in their own containers, however, Readability saw them as non-content elements.

This is especially rough because many might consider the new version to be better markup (especially if they used <section> instead of <div>, but 🤷), even though Readability handles it poorly.