Closed Mr0grog closed 4 years ago
Thinking about this more, I think we should just get rid of the factor for now. What we really need is more nuanced diffing that counts how many versions there were with meaningfully different content (ignoring things like postback data fields, hashes on the end JS source URLs, etc.) and looks at that to determine whether changes were extra churn-y. Without that, it’s better not to try and look into the number of versions.
We currently calculate a
change_count_factor
that we apply to the priority. It tries to bump up the priority when there were multiple changes during the week and reduce it when there were a huge number of changes.Having dealt with actual data for a little while now, it seems clear this doesn’t really work that well. It’s too influenced by the idiosyncracies of how many snapshots Wayback may have made of a page and whether the page is constructed in a way that makes it different on every request (e.g. having some unique-to-each-visitor data on it). We should either: