edgi-govdata-archiving / web-monitoring-processing

Tools for access, "diff"-ing, and analyzing archived web pages
https://edgi-govdata-archiving.github.io/web-monitoring-processing
GNU General Public License v3.0
20 stars 20 forks source link

Create an analyzer that checks for simple, ignorable non-text changes #175

Open Mr0grog opened 6 years ago

Mr0grog commented 6 years ago

As a first test of all the things needed to automatically rate a change’s significance, priority, let’s start with something simple that looks for changes that we can pretty confidently say aren’t meaningful:

Example: https://monitoring.envirodatagov.org/page/b2b0b8cb-5e9b-4178-91c0-b8cb4466d2bd/b76dd1ab-a7aa-41d6-89f3-c45117a80dc5..2b55beed-db97-4249-b30a-600f61d94eb5

This is an easy analysis to do (and covers a lot of the kinds of changes I think we see), so it’s a good way to make sure we’ve built out:

Mr0grog commented 6 years ago

At this weeks analyst meeting, CAPTHAs came up as another constantly changing thing that is hopefully easy to identify.

Also:

More far out:

We should probably turn this issue into an umbrella/epic issue for all these different ideas and pieces of work.

Mr0grog commented 6 years ago

From some BLM examples @jschell42 sent me:

There’s definitely an interesting thing here I wasn’t thinking about before… we could make a big split in prioritization based simply on textual (+ images and such) content changes. I can see some super-useful annotation data we could display for analysts (especially in their sheets) like:


Some diffs for examples:

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

Mr0grog commented 5 years ago

Another example of something that should really be totally ignored: https://monitoring.envirodatagov.org/page/c4328d30-cada-452f-8642-4bff721f5fc2/9a448c37-9285-4107-9ffd-ea72214561a4..a8fab661-07bb-4409-92f7-f73deadf4e29 (change to class attribute)