edgi-govdata-archiving / web-monitoring-task-sheets

Experimental new tool for generating weekly analyst task sheets for web monitoring
GNU General Public License v3.0
3 stars 0 forks source link

Consider an absolute # characters changed threshold for textual changes #3

Open Mr0grog opened 4 years ago

Mr0grog commented 4 years ago

We should consider factoring in the absolute number of changed characters or words into the how textual changes contribute to priority. In extremely large pages, even a large change (which is worth looking at) can seem small percentage-wise. For example, only 1.1% of the text here changed, but that’s still 1,785 characters!

https://monitoring.envirodatagov.org/page/6767f063-29f7-4c50-93d0-b851d0292c98/4da08f36-ab67-463d-8517-cf191857dc02..0eae6081-9fac-4f00-b914-f19c0218e7fe

Currently, we only look at the percentage changed: https://github.com/edgi-govdata-archiving/web-monitoring-task-sheets/blob/54a6759da80127305d891250a31fa0d2531cc203/analyst_sheets/analyze.py#L324-L325

Mr0grog commented 4 years ago

Maybe the easiest way to do this is to put a ceiling on how many characters of a page we’ll consider, e.g. pretend a page can never be longer than 5,000 (?) characters. That way, this example change above would have equated to 35.7% changed rather than 1.1% changed.