edgi-govdata-archiving / web-monitoring-task-sheets

Experimental new tool for generating weekly analyst task sheets for web monitoring
GNU General Public License v3.0
3 stars 0 forks source link

Reduce priority of changes in “news boxes” #7

Closed Mr0grog closed 3 years ago

Mr0grog commented 3 years ago

A lot of pages have boxes/panels/asides with recent news items, and we should probably try not to include them when calculating priority.

In pages that can be parsed with Readability, this probably isn’t much of an issue (the news boxes should be removed from the main page content), but in pages that aren’t readable, this is an issue. For non-readable pages, we should:

  1. Develop some heuristics to try and pull out the main content
  2. Develop some heuristics to remove erroneous stuff like these news boxes from the above (since they will often be in the main content area)

Here’s an example of a really tough page (identifying the news box is hard here):

Screen Shot 2020-11-11 at 3 13 43 PM

(From https://monitoring.envirodatagov.org/page/a4789792-e6d3-4b6e-948f-b8589b9c1ae3/cc83203e-0b34-4e44-8982-9c78ae6df41e..ea744de5-12be-4f15-8e2d-7e04a575bd76)

Mr0grog commented 3 years ago

So I think we can do a few things, and should do them each piecemeal, as separate work:

  1. If a page is not readable, have some fallback heuristics to identify the main content area and only diff that. This can start simple. For example, in the case above, looking for items matching this selector and picking the deepest one would have done a reasonably decent job:

    main, [role="main"], #main-content, .main-content, .main-column, #main, .main

    (Maybe pick the deepest one with > X descendant nodes, just to make sure we didn’t pick something too narrow?)

  2. Get more fancy by removing obviously erroneous nodes. (Some selectors on this page that jump out and reasonably general: .region-preface, [id*="social"], [id*="share"], [id*="sharing"], [class*="social"], [class*="share"], [class*="sharing"].

    Another good heuristic here: elements that are mostly links with target="_blank" attributes. (Although that might be a tough one to determine in a sane and expedient way.)

  3. Try and remove this news block. Since it’s so generic, we probably need a way to figure out whether this is a news-focused page and only remove it if not. (Ideas: “news,” “press,” or “blog” in the URL or title). Then identify this block. Maybe a selector like: .box.news, .panel.news, .pane.news, .panel-pane.news, [class*="news-release"] (worried that last one might be too generic).

Mr0grog commented 3 years ago

Goes without saying this needs a wider survey across some random pages from other domains in order to make sure some of the suggested selectors above are reasonable. It would be better to be too specific (and fail to detect a main content area, or news block or whatever), than too broad (detecting something that’s not main content).

Mr0grog commented 3 years ago

Made some quick attempts at addressing this for today’s run. Not a remotely complete or perfect solution, but it solves the specific case at the start of this issue. ¯\_(ツ)_/¯

Mr0grog commented 3 years ago

This has since been done.