Closed Mr0grog closed 2 years ago
Thinking through the savings available from each approach...
These will reduce the load by a constant factor:
Loading fewer versions has more potential because you could reasonably log-scale it such that it works out to be "max one per day "for the last couple months and then more like "max one per week" and then "max one per month" as the past recedes. This reminds me of a minor point in a Paul Gannsle talk. (He looks after Python's datetime
module and timezone handling.) The point is: the further away in time, the less it matters exactly when for any practical purpose. How concerned are we that sub-sampling would cause us to hide patterns where something is changed and then changed back? If that's not a major concern, then I really like this option.
Don’t worry about loading all versions in the UI before rendering would of course scale arbitrarily high. I don't know the details of the UI code well enough to sense how big an effort this would be.
Finally, I agree with your comment that Maybe don't load a page’s full history in the UI has a downside, so maybe that's one to do last if at all.
👍 Any thoughts on the specific formatting of that by-date response? Or on the route where it should live? For example:
/versions/byTime
/versions/summary
/versions/summarize
/versions/digest
Maybe with a querystring specifying the time period we group by: ?period=day|week|month|scaled
you could reasonably log-scale it such that it works out to be "max one per day "for the last couple months and then more like "max one per week" and then "max one per month" as the past recedes.
Oh, that is a neat idea! (Also: gonna have to find and watch that talk if there’s a recording.)
How concerned are we that sub-sampling would cause us to hide patterns where something is changed and then changed back?
I don’t think that’s a huge concern — most of the time, analysts try not to report on those kinds of changes and avoiding surfacing them in analysts’ weekly sheets is one of the big reasons the task-sheets job does a complete analysis for each week, rather than just summing up the version-by-version changes like we used to.
I also lightly asked the analyst team Monday whether they’d be concerned about only seeing one version per day in those dropdowns, and they thought it would be fine.
All that said, we should probably be careful to still make it possible to see other versions from a day if someone puts the right version ID in the UI’s URL, for example. We should just be making it harder to get to, not hiding it altogether. (Also why I had the version_count
field in my strawman response example above; we can at least tell someone that there are more versions that are hidden.)
Don’t worry about loading all versions in the UI before rendering would of course scale arbitrarily high. I don't know the details of the UI code well enough to sense how big an effort this would be.
I haven’t really thought hard about how complex this might be. IIRC, the UI just loads all the data it needs and goes about its business; it doesn’t have any provision for just loading parts of it and doing more in the background, so this is probably medium-to-large-sized. OTOH, now that I’m thinking about it, load fewer fields for each version is actually probably the hardest. 😛 We’d have to have a way for the UI to know that it only has a less-detailed record for a given version, and to go off and request the full details when the version is selected (right now, the display of full details and the list of known versions for the dropdown come directly from the same place; it’s assumed that if we know about a version ID, we know all the info for that version ID).
Notes from call: I suggested /versions/sampled
or /versions/subsampled
or /versions/decimated
with time-based pagination and some hard-coded (log?) scaling. A tunable parameter to change the scaling could always be added later.
Only 3/4 of a year later, I started on this: https://github.com/edgi-govdata-archiving/web-monitoring-db/pull/946
And now I finally got around to implementing the front-end part of it in edgi-govdata-archiving/web-monitoring-ui#956. The backend wound up needing a second pass, too. See edgi-govdata-archiving/web-monitoring-db#992.
Marking this as effectively done for now.
Between the fact that we’ve now been running for a few years and that we pull in a lot from archive.org, loading some pages in the UI is incredibly expensive and slow. For example, https://monitoring.envirodatagov.org/page/178eea2a-9bce-4c41-83b8-a874adf3f09e/cce724f8-7d46-425f-a071-832edbfcc7ec..5fd3727f-b379-4392-92d3-f3497a2d1464 loads more than 36 MB of data from the API before it can render! This is an extreme example, but not the only one that’s big. This mostly applies to pages that get really frequently saved with Wayback’s with Save Page Now feature.
This is clearly a big problem and needs coordination among multiple pieces to solve well. Some ideas:
Compress API responses (should be a no-brainer: edgi-govdata-archiving/web-monitoring-db#857).
Don’t worry about loading all versions in the UI before rendering; just grab the requested ones, or the latest two. Get the rest in the background.
Maybe don't load a page’s full history in the UI (although that means a user can’t select all the known versions for comparison, which is not good).
Load fewer fields for each version (e.g. just dates and status codes)
Load fewer versions, e.g. max 1 version per day, so if a page had 8 versions on a given day, we only show one of them in the response. For example, the versions list response might not be a list of versions, but a list of dates like so: