Open Mr0grog opened 4 years ago
Maybe the easiest way to do this is to put a ceiling on how many characters of a page we’ll consider, e.g. pretend a page can never be longer than 5,000 (?) characters. That way, this example change above would have equated to 35.7% changed rather than 1.1% changed.
We should consider factoring in the absolute number of changed characters or words into the how textual changes contribute to priority. In extremely large pages, even a large change (which is worth looking at) can seem small percentage-wise. For example, only 1.1% of the text here changed, but that’s still 1,785 characters!
https://monitoring.envirodatagov.org/page/6767f063-29f7-4c50-93d0-b851d0292c98/4da08f36-ab67-463d-8517-cf191857dc02..0eae6081-9fac-4f00-b914-f19c0218e7fe
Currently, we only look at the percentage changed: https://github.com/edgi-govdata-archiving/web-monitoring-task-sheets/blob/54a6759da80127305d891250a31fa0d2531cc203/analyst_sheets/analyze.py#L324-L325