civiform / civiform

CiviForm simplifies the application process for government benefits programs by re-using applicant data for multiple benefits applications. It's being developed by Google.org and Exygy, in collaboration with the City of Seattle and community contributors.
https://civiform.us
Apache License 2.0
98 stars 61 forks source link

TDD CSV/JSON Export Needs Significant Perf and Error Improvement #8147

Closed gwendolyngoetz closed 6 days ago

gwendolyngoetz commented 2 months ago

Update: This TDD compiles potential performance improvements. Not all ideas in the TDD have corresponding issues: https://docs.google.com/document/d/1jsGEv1m9Ky2mbEbvTyZUbplFaPEiwNxopWPjm_XaRXs/edit?tab=t.0

The exports (CSV / JSON) take a long time to run when programs have large amounts of data. This results in those program exports failing, typically with an HTTP 504 cause by the Fargate Load Balancer reaching it's 60 second timeout.

As a temporary stop gap, I've increased Seattle's from 60 seconds to 120 seconds. This is not ideal, nor a long term solution.

This problem has been kicked down the road for a few years now. One of Bion's suggested approaches to go with an async approch. Allow admins to submit a request for an export and they'd get emailed a link to download the file upon generation.

Perhaps that approach. Perhaps we need to rewrite the code generating the export. No matter the approach we have got to do something to deal with the growing outage of the export feature.

Non-exhaustive search of past timeout issues

Non-exhaustive search of past breakages of the export

PRs not attached to other issues

Issues to come out of this TDD and research:

nb1701 commented 2 months ago

See https://github.com/civiform/civiform/issues/8145 for details on how it manifested in Seattle.

I have a local reproduction, using the program dump from DB, and then generating 2000 applications to the program noted in the other issue.

avaleske commented 2 months ago

2000 is... not a lot. The other issue mentioned v2.4.2 specifically, but this is an issue with any program that has a lot of applications to export, right?

gwendolyngoetz commented 2 months ago

Maybe applications + many columns.

nb1701 commented 2 months ago

Yeah, and maybe some visibility conditions/repeated question complexity. Not sure how much that plays a part here, but I think it is doing those calculations for every application.

avaleske commented 2 months ago

Does this still happen if you use the paginated export?

There's a lot of work we're doing and then throwing away, just cause it was easier to leverage existing access methods and data structures instead of making more specific ones.

For example, in addDataToBuilder(), called by getSummaryDataAllQuestions(), https://github.com/civiform/civiform/blob/85c1885d6756daffae8f7dd8c8d67dfd70170276/server/app/services/applicant/ReadOnlyApplicantProgramServiceImpl.java#L332 we're creating a whole AnswerData, when we don't use all of it.

I have theories about what's slow, but it'd be worth doing some profiling to see where we should focus our optimization effort and if it's the same for JSON vs CSV

nb1701 commented 2 months ago

I have a DB dump with a reproduction, and JFR traces, but GitHub won't let me attach them. I'll share them over slack.

Parallelizing the getSummaryDataOnlyActive function helps, but doesn't really address the root cause.

gwendolyngoetz commented 2 months ago

I think zipping them or changing them to a .txt extension will let you attach them. Unless it's a file size issue.

avaleske commented 1 month ago

I was able to repro this with 2000 applications to Comprehensive Sample Program:

So it's likely something in ReadOnlyApplicantProgramService.getSummaryDataAllQuestions(), but I haven't timed it separately from the logic that builds the final JSON to export, yet.

nb1701 commented 1 month ago

When I was looking at this before, both the getBlocks and addDataToBuilder inside that function were slow. However, getBlocks was maybe slow because I was looking at getSummaryDataOnlyActive and that one includes calculating visibility on each block.

In the latter function, I believe it is eligibility calculation and perhaps getting the last updated time that were slow, but I don't remember for sure.

avaleske commented 1 month ago

From where I was before, commenting out everything in JsonExporterService that actually builds the json to export (see this gist) cuts the time in half again to ~5-6 seconds.

Roughly, then, it takes:

I wonder if it's a combination of the GC thrashing cause we have a lot of stuff in memory and jayway/jsonpath being slow? I noticed a lot of time in your profiler output is spent there, which makes sense but could also be a source of slowness: image

Also need to look into the synchronization stuff in that screenshot. This code isn't parallelized so I'm not sure what it's doing there.

nb1701 commented 1 month ago

It does, but I tried setting the heap to 8GB and it does far less thrashing (but still some), but is still fairly slow.

nb1701 commented 1 month ago

With 8GB heap

image
nb1701 commented 1 month ago

Versus our usual default 1GB heap

image
avaleske commented 6 days ago

There is still performance work to do but I've created new issues to track it. Now that the TDD is done I'm closing this.