edgi-govdata-archiving / web-monitoring-task-sheets

Experimental new tool for generating weekly analyst task sheets for web monitoring
GNU General Public License v3.0
3 stars 0 forks source link

Don’t consider N changes any more important than 1 #2

Closed Mr0grog closed 4 years ago

Mr0grog commented 4 years ago

We currently calculate a change_count_factor that we apply to the priority. It tries to bump up the priority when there were multiple changes during the week and reduce it when there were a huge number of changes.

Having dealt with actual data for a little while now, it seems clear this doesn’t really work that well. It’s too influenced by the idiosyncracies of how many snapshots Wayback may have made of a page and whether the page is constructed in a way that makes it different on every request (e.g. having some unique-to-each-visitor data on it). We should either:

  1. Don’t distinguish between 1 and the current “sweet spot” where we prioritize most highly (0.8 changes/day).
  2. Don’t include this factor at all.
Mr0grog commented 4 years ago

Thinking about this more, I think we should just get rid of the factor for now. What we really need is more nuanced diffing that counts how many versions there were with meaningfully different content (ignoring things like postback data fields, hashes on the end JS source URLs, etc.) and looks at that to determine whether changes were extra churn-y. Without that, it’s better not to try and look into the number of versions.