CDLUC3 / dmptool

DMPTool version of the DMPRoadmap codebase
https://dmptool.org
MIT License
59 stars 13 forks source link

Maxing out memory and swap space #419

Open briri opened 1 year ago

briri commented 1 year ago

The instances have been periodically experiencing high memory usage which eventually results in maxing out our swap space.

This screenshot shows memory/swap usage and IOPS during a recent incident:

Screen Shot 2023-01-06 at 7 34 59 AM

Swap begins escalating dramatically between 10 AM - 11 AM on 12/29. It then maxes out around 8 PM on 12/30.

Apache and Rails server logs show no unusual traffic. The apache access logs show only 152 requests from 08 AM - 01 PM on 12/29

Suspect an issue with the rack_attack gem we use for rate limiting and throttling malicious activity. These issue coincide with the introduction of the gem, but that may be an invalid correlation.

briri commented 1 year ago

Our plan:

briri commented 1 year ago

Removed the rack_attack gem from the stage environment and we are still seeing the same behavior. Memory usage steadily increases so we expect that there is a memory leak somewhere.

We're using the default Rails memory store which is 'FileStore', so it should be using IO to read/write from [project_root]/tmp for it's cache.

I am going to inspect the apache logs in the stage env (since the traffic there is low) to see what actual requests its handling and see if we can drill in from there.

I'll also do a diff of our Gemfile and package.json against what's in the core DMPRoadmap codebase since the other installations are not seeing this type of behavior (although they are not yet running on Rails 6 version)

briri commented 1 year ago

Going to introduce ActiveStorage and DelayedJob in early November which will auto generate narrative PDFs for public plans in the background. This should mitigate some of our 500 level errors we see when bots harvest these PDF files.

We will also be offloading all communication with the DMPHub to delayed_job to let things process in the background. While implementing this, we discovered a small loop in the callback logic that was causing DMPTool to send updates to the DMPHub 4 times instead of once. Not sure if this is contributing to the memory issues, but it should at least help.

briri commented 7 months ago

we put a cron job in place to restart puma on a schedule as a band-aid for this

briri commented 6 months ago

1) Take 02 out from behind the ELB (monitor to see if the leak is traffic related). Also turn off delayed_job on 01 and restart both 2) Plan is to create a branch and remove elements like the wkhtmltopdf gem and run on a single instance to see. 3) Send logs to OpenSearch