edgi-govdata-archiving / web-monitoring-diff

Tools for diffing and comparing web content. Also includes a web server that makes diffs available as an HTTP service.
https://web-monitoring-diff.readthedocs.io/
GNU General Public License v3.0
11 stars 4 forks source link

`html_token` diff should highlight links when only `href` attr changed #4

Open Mr0grog opened 6 years ago

Mr0grog commented 6 years ago

The html_token diff (based on lxml.html.diff) used to pick up changes to a link’s href attribute by virtue of sticking the text “Link: [url]” after the link’s main text. This got really rough and noisy in practice, though, because many links are images (so the text broke the layout) or because URLs are extremely long and unbreakable (making layout nuts, especially when the link is a heading, which is very common), so we removed it by request of the analyst team. That left us with a single blank space getting highlighted in its stead, which was confusing as all get out (it also happened in other situations besides links), so we removed that, too, and now you can’t tell visually when a link’s URL has changed at all.

The change_href test case demonstrates this really well. Output looks like:

<html>
  <!-- clipped to brevity -->
  <body>
    <p>
      Here is a pragraph with
      <a href="/elsewhere">
        A link
      </a>
      to another site.
    </p>
    <script id="wm-diff-script"><!-- clipped for brevity --></script>
  </body>
</html>

Not sure what we want the ideal markup to look like in this case, but we should be signaling the change in the markup somehow. Once we’ve solved that, we should also add an actual pass/fail “validity” test.

See further notes in the code about this here: https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/f899fa2ef088ef83398a65a05f04321ab659eda2/web_monitoring/html_diff_render.py#L444-L460

Mr0grog commented 5 years ago

Side note: we should also make sure to cover cases where text becomes a link if it wasn’t before. For example, “energy efficiency” in “Improving energy efficiency is” in this change: https://monitoring.envirodatagov.org/page/dabe8be9-ff11-415e-b632-03a846e79999/ccfd24af-0522-4a60-b59a-f865d67d3bfc..c6c0bd08-06c6-4e56-a33e-c1271efbcfb9

I think most solutions for this would do that, but I’m adding this as a note to make sure it doesn’t slip through the cracks here.