Open briri opened 1 year ago
Our plan:
Removed the rack_attack
gem from the stage environment and we are still seeing the same behavior. Memory usage steadily increases so we expect that there is a memory leak somewhere.
We're using the default Rails memory store which is 'FileStore', so it should be using IO to read/write from [project_root]/tmp
for it's cache.
I am going to inspect the apache logs in the stage env (since the traffic there is low) to see what actual requests its handling and see if we can drill in from there.
I'll also do a diff of our Gemfile and package.json against what's in the core DMPRoadmap codebase since the other installations are not seeing this type of behavior (although they are not yet running on Rails 6 version)
Going to introduce ActiveStorage and DelayedJob in early November which will auto generate narrative PDFs for public plans in the background. This should mitigate some of our 500 level errors we see when bots harvest these PDF files.
We will also be offloading all communication with the DMPHub to delayed_job to let things process in the background. While implementing this, we discovered a small loop in the callback logic that was causing DMPTool to send updates to the DMPHub 4 times instead of once. Not sure if this is contributing to the memory issues, but it should at least help.
we put a cron job in place to restart puma on a schedule as a band-aid for this
1) Take 02 out from behind the ELB (monitor to see if the leak is traffic related). Also turn off delayed_job on 01 and restart both 2) Plan is to create a branch and remove elements like the wkhtmltopdf gem and run on a single instance to see. 3) Send logs to OpenSearch
The instances have been periodically experiencing high memory usage which eventually results in maxing out our swap space.
This screenshot shows memory/swap usage and IOPS during a recent incident:
Swap begins escalating dramatically between 10 AM - 11 AM on 12/29. It then maxes out around 8 PM on 12/30.
Apache and Rails server logs show no unusual traffic. The apache access logs show only 152 requests from 08 AM - 01 PM on 12/29
Suspect an issue with the rack_attack gem we use for rate limiting and throttling malicious activity. These issue coincide with the introduction of the gem, but that may be an invalid correlation.