Open Mr0grog opened 6 years ago
Side note: we should also make sure to cover cases where text becomes a link if it wasn’t before. For example, “energy efficiency” in “Improving energy efficiency is” in this change: https://monitoring.envirodatagov.org/page/dabe8be9-ff11-415e-b632-03a846e79999/ccfd24af-0522-4a60-b59a-f865d67d3bfc..c6c0bd08-06c6-4e56-a33e-c1271efbcfb9
I think most solutions for this would do that, but I’m adding this as a note to make sure it doesn’t slip through the cracks here.
The
html_token
diff (based on lxml.html.diff) used to pick up changes to a link’shref
attribute by virtue of sticking the text “Link: [url]” after the link’s main text. This got really rough and noisy in practice, though, because many links are images (so the text broke the layout) or because URLs are extremely long and unbreakable (making layout nuts, especially when the link is a heading, which is very common), so we removed it by request of the analyst team. That left us with a single blank space getting highlighted in its stead, which was confusing as all get out (it also happened in other situations besides links), so we removed that, too, and now you can’t tell visually when a link’s URL has changed at all.The
change_href
test case demonstrates this really well. Output looks like:Not sure what we want the ideal markup to look like in this case, but we should be signaling the change in the markup somehow. Once we’ve solved that, we should also add an actual pass/fail “validity” test.
See further notes in the code about this here: https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/f899fa2ef088ef83398a65a05f04321ab659eda2/web_monitoring/html_diff_render.py#L444-L460