edgi-govdata-archiving / web-monitoring-diff

Tools for diffing and comparing web content. Also includes a web server that makes diffs available as an HTTP service.
https://web-monitoring-diff.readthedocs.io/
GNU General Public License v3.0
11 stars 4 forks source link

Try to identify situations where parts of a page have *moved* #16

Open Mr0grog opened 6 years ago

Mr0grog commented 6 years ago

One situation we see that can be confusing for analysts is when a portion of the page has moved by swapping locations with another (maybe two paragraphs get reversed, maybe the navigation moves from the bottom to the top of the page, etc). Are there any non-expensive heuristics we can use to identify this? I think it matters most in the html_token and text diffs.

My simpleton idea here is to do the diff, then:

  1. Look for insertions or deletions that have a lot of tokens (so only look at reasonably large changes) — maybe 50+ tokens (I pulled that number out of thin air, to be clear)

  2. Check whether there is a corresponding deletion or insertion with the same set of tokens

    Note: in the HTML diff, we should do this before calling assemble_diff so that the tokens we are comparing are the words, not words + tags. I think we’ll be significantly more likely to fail matching once tags are added back into the token stream.

  3. Classify those matching pairs as moved (instead of inserted or deleted) (not sure how best to show this visually).

  4. (bonus!) based on whether the insertion half of the pair is earlier or later in the token stream, classify the direction of the move.

  5. (bonus!) assign each pair an ID so in the unified diff view, we can actually draw a line connecting them (but we’ll leave actually drawing that line for later — and it probably belongs in the UI project). Not sure how feasible it will be to connect them in the side-by-side views.

Some examples:

Mr0grog commented 6 years ago

First: we need to gather up some concrete examples of this situation.

Mr0grog commented 6 years ago

Alternative thought for identifying moves with changes inside them: do a simhash on each large token-chunk and consider them equal if their hashes are close enough (how close? idunno! ¯\_(ツ)_/¯). Then re-diff those token streams to identify the insertions and deletions inside them (kind of like the two-level diffing we now do with the links diff).

Mr0grog commented 6 years ago

Example of a move with a change inside it, as noted above: https://monitoring.envirodatagov.org/page/8db5264c-715b-4a0e-8c5a-7036f91e8b15/bc56f6ae-fa86-4e7a-baa1-5142a703ed1d..7ed11cad-1553-47c0-9f07-b07b2d1fe03f

Mr0grog commented 6 years ago

Another with a change inside: https://monitoring.envirodatagov.org/page/5d79428e-6abe-4090-b81c-ed503510b1b1/3ba94e1f-4585-4739-b53b-4053a767af0b..8ee14e56-3dce-4729-a5ac-bdb853dd0623

Mr0grog commented 6 years ago

See also edgi-govdata-archiving/web-monitoring#146 about zhang-shasha edit distance. Would be good to look at the performance of a number of edit distance/similarity algorithms (I suggested simhash above, but no idea how expensive that may be) and pick (for this use) whatever is fastest.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

Mr0grog commented 5 years ago

Some good examples of chunking and moving on this page: https://monitoring.envirodatagov.org/page/1f0bd347-60c1-47cf-9dac-cc4f86345e43/4728ed96-cfdd-471b-a380-139beeccbd63..99c12d5e-d535-40e3-a46b-a25ab588923c