Closed goetzrrGit closed 5 months ago
Our finding and fix:
We updated the aerie-user-ts-code-runner
plugin with a two-line code change to fix a memory leak issue.
(PR : https://github.com/NASA-AMMOS/aerie-ts-user-code-runner/pull/37 )
We’re also adding a new configuration option (“knob”) to the Docker Compose file for the sequencing server that allows you to set a maximum number of workers. (PR: https://github.com/NASA-AMMOS/aerie/pull/1476 )
Originally, I thought the worker pool was capped at 8 workers. However, we discovered that it actually starts with 8 and scales up as the workload increases, causing a spike in memory usage (RSS). In tests, we saw up to 20 workers spawn, leading to a significant memory increase.
Checked for duplicates
Yes - I've already checked
Is this a regression?
No - This is a new bug
Version
Describe the bug
@cartermak reported that an expansion run for TT-8 is failing. Upon further investigation, we have found this:
We identified a memory leak issue where memory is not being released properly after an expansion run. Imagine this scenario:
Expansion A finishes and uses 1 GB of memory. Expansion B starts, but instead of reclaiming the memory used by A, it allocates an additional 1 GB, bringing the total used memory to 2 GB. This pattern continues with each subsequent expansion, causing a cumulative memory increase. Eventually, the server runs out of memory and crashes.
During testing with the problematic setup, we observed a significant spike in Resident Set Size (RSS) memory (~11GB). RSS represents the total amount of physical memory actively used by a process. This spike indicates inefficient memory utilization, even beyond the leak issue.
Reproduction
Run sequence expansion and measure the memory
Logs
No response
System Info
Severity
Critical