edgi-govdata-archiving / web-monitoring

Documentation and project-wide issues for the Website Monitoring project (a.k.a. "Scanner")
Creative Commons Attribution Share Alike 4.0 International
105 stars 17 forks source link

Reduce how much data UI needs to load on pages with many versions #161

Closed Mr0grog closed 2 years ago

Mr0grog commented 3 years ago

Between the fact that we’ve now been running for a few years and that we pull in a lot from archive.org, loading some pages in the UI is incredibly expensive and slow. For example, https://monitoring.envirodatagov.org/page/178eea2a-9bce-4c41-83b8-a874adf3f09e/cce724f8-7d46-425f-a071-832edbfcc7ec..5fd3727f-b379-4392-92d3-f3497a2d1464 loads more than 36 MB of data from the API before it can render! This is an extreme example, but not the only one that’s big. This mostly applies to pages that get really frequently saved with Wayback’s with Save Page Now feature.

This is clearly a big problem and needs coordination among multiple pieces to solve well. Some ideas:

danielballan commented 3 years ago

Thinking through the savings available from each approach...

These will reduce the load by a constant factor:

Loading fewer versions has more potential because you could reasonably log-scale it such that it works out to be "max one per day "for the last couple months and then more like "max one per week" and then "max one per month" as the past recedes. This reminds me of a minor point in a Paul Gannsle talk. (He looks after Python's datetime module and timezone handling.) The point is: the further away in time, the less it matters exactly when for any practical purpose. How concerned are we that sub-sampling would cause us to hide patterns where something is changed and then changed back? If that's not a major concern, then I really like this option.

Don’t worry about loading all versions in the UI before rendering would of course scale arbitrarily high. I don't know the details of the UI code well enough to sense how big an effort this would be.

Finally, I agree with your comment that Maybe don't load a page’s full history in the UI has a downside, so maybe that's one to do last if at all.

Mr0grog commented 3 years ago

👍 Any thoughts on the specific formatting of that by-date response? Or on the route where it should live? For example:

Maybe with a querystring specifying the time period we group by: ?period=day|week|month|scaled

you could reasonably log-scale it such that it works out to be "max one per day "for the last couple months and then more like "max one per week" and then "max one per month" as the past recedes.

Oh, that is a neat idea! (Also: gonna have to find and watch that talk if there’s a recording.)

How concerned are we that sub-sampling would cause us to hide patterns where something is changed and then changed back?

I don’t think that’s a huge concern — most of the time, analysts try not to report on those kinds of changes and avoiding surfacing them in analysts’ weekly sheets is one of the big reasons the task-sheets job does a complete analysis for each week, rather than just summing up the version-by-version changes like we used to.

I also lightly asked the analyst team Monday whether they’d be concerned about only seeing one version per day in those dropdowns, and they thought it would be fine.

All that said, we should probably be careful to still make it possible to see other versions from a day if someone puts the right version ID in the UI’s URL, for example. We should just be making it harder to get to, not hiding it altogether. (Also why I had the version_count field in my strawman response example above; we can at least tell someone that there are more versions that are hidden.)

Don’t worry about loading all versions in the UI before rendering would of course scale arbitrarily high. I don't know the details of the UI code well enough to sense how big an effort this would be.

I haven’t really thought hard about how complex this might be. IIRC, the UI just loads all the data it needs and goes about its business; it doesn’t have any provision for just loading parts of it and doing more in the background, so this is probably medium-to-large-sized. OTOH, now that I’m thinking about it, load fewer fields for each version is actually probably the hardest. 😛 We’d have to have a way for the UI to know that it only has a less-detailed record for a given version, and to go off and request the full details when the version is selected (right now, the display of full details and the list of known versions for the dropdown come directly from the same place; it’s assumed that if we know about a version ID, we know all the info for that version ID).

danielballan commented 3 years ago

Notes from call: I suggested /versions/sampled or /versions/subsampled or /versions/decimated with time-based pagination and some hard-coded (log?) scaling. A tunable parameter to change the scaling could always be added later.

Mr0grog commented 2 years ago

Only 3/4 of a year later, I started on this: https://github.com/edgi-govdata-archiving/web-monitoring-db/pull/946

Mr0grog commented 2 years ago

And now I finally got around to implementing the front-end part of it in edgi-govdata-archiving/web-monitoring-ui#956. The backend wound up needing a second pass, too. See edgi-govdata-archiving/web-monitoring-db#992.

Mr0grog commented 2 years ago

Marking this as effectively done for now.