Make sure we can handle generating statistics with millions of rows

SuffolkLITLab / docassemble-InterviewStats

A docassemble extension.

MIT License

1 stars 0 forks source link

Make sure we can handle generating statistics with millions of rows #9

Closed nonprofittechy closed 2 years ago

BryceStevenWilley commented 2 years ago

Is this just general scaling issues, or any specific pain points that stand out?

nonprofittechy commented 2 years ago

General scaling questions. I don't think we'll exceed millions ever and I want to see if we are likely to ever need background processing.

BryceStevenWilley commented 2 years ago

Took a quick look at this while I was thinking about it, and started making some quick changes in https://github.com/SuffolkLITLab/docassemble-InterviewStats/commit/3dd43202fe16517a1ef4ac545613eb4384438559. However, to make significant progress, we'll need to:

actually see what would make a difference, which needs a DA server with millions of rows that isn't prod, lol. I'll look into doing that soon.
rethink the interview flow and what sort of statistics we show on the intro screen. Background processing might be needed for exporting all of the data to a spreadsheet, but it shouldn't be for displaying slices of the data, as long as we're smart about our SQL usage. Won't try to over engineer here, but at the least pushing off anything that isn't a summary stat to another screen would be doable and make for much better 2nd page performance.

nonprofittechy commented 2 years ago

Maybe we can make a mirror of prod? I can get the postgres dump to you if you want. Right now prod struggles to display the stats for the CDC moratorium. I was trying to pull that one as an example [for our technical slides] but it didn't work.

BryceStevenWilley commented 2 years ago

Haha, that's why I was taking another look at this, I remember running into trouble last month with it. A prod-copy would be helpful, but seems really risky privacy-wise. I was just thinking an interview that calls store_variables_snapshot a million times with fake / random data and to let it run locally for however long that takes (it should only take about an hour max?) The effort part would be making the data similar to what's stored in the CDC moratorium.

nonprofittechy commented 2 years ago

That makes sense, and would have additional uses for future stress tests.

BryceStevenWilley commented 2 years ago

Some specific numbers: we're timing out between 3000 and 5000 rows. However, I can't actually tell what's taking such a long time: when I bring down the number of rows such that it still takes ~20-30 seconds to load the map screen, the "show variables" button (now just looks like </>) claims that it only took < 1 second to make the page. From the network tab, it doesn't look like it's something in Bokeh that making everything take so long, since everything hangs on the initial GET call to the DA server. I'm going to need to get back into the guts of the server and start adding logmessages everywhere, which will take more time than I thought it would.

nonprofittechy commented 2 years ago

I added some Pandas stuff to the help area which might be to blame for at least some of the speed issues

BryceStevenWilley commented 2 years ago

I took out that help part, and we are extremely snappy on 47k rows now! Scaling up to see how much we can handle before timing out again.

Thanks for the tip @nonprofittechy, you saved me at least of day of deep diving.