fecgov / fecfile-web-api

Back-end API for FECfile application
8 stars 2 forks source link

SPIKE - Investigate cpu spiking on restart of fecfile-web-api and fecfile-web-services #861

Open dheitzer opened 2 months ago

dheitzer commented 2 months ago

While investigating ticket #807 for Celery cpu usage, it was noted that the CPU for both the fecfile-web-api and fecfile-web-services containers are spiking on startup/restart. The [Kibana 'App Metrics' dashboard](https://logs.fr.cloud.gov/app/dashboards#/view/App-Metrics?_g=(filters:Unable to render embedded object: File ((pause:!t,value:0),time:(from:now-15m,to:now))&_a=(description:'',filters:) not found.(('$state':(store:appState),meta:(alias:!n,disabled:!f,key:query,negate:!f,type:query_string,value:''),query:(query_string:(analyze_wildcard:!t,query:'')))),fullScreenMode:!f,options:(darkTheme:!f),query:(language:kuery,query:''),timeRestore:!f,title:'App%20-%20Metrics',viewMode:view&_a=(description:'',filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,key:query,negate:!f,type:query_string,value:''),query:(query_string:(analyze_wildcard:!t,query:'')))),fullScreenMode:!f,options:(darkTheme:!f),query:(language:kuery,query:''),timeRestore:!f,title:'App%20-%20Metrics',viewMode:view))) can be used to view resource utilization and can be filtered on specific environments (by default it aggregates all environments).

cf app fecfile-web-services command can be used to see a snapshot of resource utilization on the VM

cf ssh fecfile-web-services ps -eo pcpu,pid,user,args | tail -n +2 | sort -k1 -r -n | head -10

command can be used to sort processes by cpu utilization

cf cpu-entitlement fecfile-web-services command can be used to see our entitlement utilization (based on our memory allocation from cloud.gov)

This story is to investigate the cause of the spikes and to see if it is a problem (e.g):

[!image-20240605-174512.png! See image in Jira| /attachments/11343?name=image-20240605-174512.png] See image in Jira See image in Jira

[| ]

  1. DEV NOTES

One possible cause may be the creation of the committee views on startup being expensive: https://github.com/fecgov/fecfile-web-api/blob/3b3d581f76dc21c632239e5fc7c64d4608bad418/bin/run-api.sh#L5

The results of this ticket should be a list of action items pertaining to the cause a potential remedies of the CPU spike.

QA Notes

null

DEV Notes

null

Design

null

FECFILE-172

exalate-issue-sync[bot] commented 2 weeks ago

Sasha Dresden commented: After a lot of digging around and research I have found the issue lies with the create_committee_views.py script which gets run every time the API is started/restarted.

Specifically, attempting to connect and manipulate the database during the spin up process causes a cpu spike. I attempted a few different tests on my local docker setup, where it would rebuild the committee view when there was a single committee with either 0 transactions or 10,000 transactions, or just skip it if it found the committee view already existing. All of them led to a spike in CPU usage. The only thing that worked was to not initiate the connection, either by removing the script from the start up process or adding a flag to tell it not to run it.

When the CPU spike happened, CPU utilization was 100% before settling back down to sub-10%. When the script was skipped, the CPU utilization never went over 10%.

Looking into the script [itself|https://github.com/fecgov/fecfile-web-api/blob/develop/django-backend/fecfiler/committee_accounts/management/commands/create_committee_views.py], it is recreating all of the committee views on startup. This seems unnecessary. It makes sense for the local setup because we are creating a new committee as part of spinning up our local docker, but with dev, stage, or production, they would all already have all of their committee views created and up to date. A future ticket could resolve whether this script could simply be removed for non-local environments. Or, if it is deemed necessary, find a way to run it only conditionally. Perhaps just when the API is started, but not restarted, which could be accomplished via setting a flag.

exalate-issue-sync[bot] commented 1 week ago

Matt Travers commented: No code to review. Sending directly to QA.

Follow on ticket created from findings of this ticket: [https://fecgov.atlassian.net/browse/FECFILE-1462|https://fecgov.atlassian.net/browse/FECFILE-1462|smart-link]

exalate-issue-sync[bot] commented 1 week ago

Shelly Wise commented: Per DEV no code review, therefore no QA review needed for this ticket.

Moved to Stage Ready.