edgi-govdata-archiving / web-monitoring

Documentation and project-wide issues for the Website Monitoring project (a.k.a. "Scanner")
Creative Commons Attribution Share Alike 4.0 International
105 stars 17 forks source link

Something funny with the priority calculation? #160

Closed gretchengehrke closed 3 years ago

gretchengehrke commented 3 years ago

Hi Rob, This is not high-priority, but I'm wondering if there could be something funny going on with the priority calculation, and specifically the key terms counting. The Scanner output file indicated climate -12 and emissions -2 for both of these pages, but I've scrolled through all of the dates between Jan 20 and Feb 6, and can't find any changes to either of those terms. https://monitoring.envirodatagov.org/page/cf2a6d66-f088-4342-8b90-fc8c0001e210/3fa0b5f9-5001-41d3-bbad-9350f218dfbd..40f3743a-5836-4090-951c-cad4f7144431 https://monitoring.envirodatagov.org/page/5f01c656-4bb3-4d84-9bbe-4e51ef34a54e/0633d575-4ca2-4644-8eda-51d59d491327..2d22b45e-5fb3-4bac-b49e-fa1bf4c2b793 The Scanner output for this page said 0.814 for text changed (I think meaning 81.4%), but I can only see a very small amount of text that changed. https://monitoring.envirodatagov.org/page/5dbeec25-9ad6-4e96-a267-343166d5d323/614a6454-6ec3-4e8f-8e4a-437fc54a219f..6e1e12c7-6ba3-4b71-b161-bd563b39e0ad

I just wanted to give you a heads up that there seem to be some oddities. It very well might be user error, but I've checked the usual culprits (intervening dates and collapsible sections) and haven't found anything yet.

Mr0grog commented 3 years ago

So: what happened here is that the EPA made a very minor tweak to the markup that resulted in Readability no longer seeing the “References” section at the bottom of the page. Lots of documents with “climate” in their name in the references section. 😞

The BLM page is similar: a markup change caused lots of other parts of the page to be included by Readability that weren’t before.

This has popped up from time to time before and I think I mentioned it as one of the issues with Readability — it works great a lot of the time, but because it’s a fuzzy algorithm for determining what the main body of the page is, it can always go kind of haywire on small markup changes.

We have a fallback that sometimes works better and sometimes less well that we use for some URLs; we can add more to the list if you think we should: https://github.com/edgi-govdata-archiving/web-monitoring-task-sheets/blob/5c111a496b2a40e84740081278cebdd020356f3e/analyst_sheets/analyze.py#L32-L34

Mr0grog commented 3 years ago

Doesn't seem like there's anything especially actionable here, so I’m closing this issue.