NASA-AMMOS / aerie

A software framework for modeling spacecraft.
https://nasa-ammos.github.io/aerie-docs/
MIT License
73 stars 19 forks source link

Mitigate Sequencing Server memory leak by limiting max number of workers #1476

Closed goetzrrGit closed 5 months ago

goetzrrGit commented 5 months ago

Description

@cartermak reported that an expansion run for TT-8 is failing. Upon further investigation, we have found this:

  1. Memory Leak:

We identified a memory leak issue where memory is not being released properly after an expansion run. Imagine this scenario:

Expansion A finishes and uses 1 GB of memory. Expansion B starts, but instead of reclaiming the memory used by A, it allocates an additional 1 GB, bringing the total used memory to 2 GB. This pattern continues with each subsequent expansion, causing a cumulative memory increase. Eventually, the server runs out of memory and crashes.

  1. High Memory Usage (RSS):

During testing with the problematic setup, we observed a significant spike in Resident Set Size (RSS) memory (~11GB). RSS represents the total amount of physical memory actively used by a process. This spike indicates inefficient memory utilization, even beyond the leak issue.

rss: '5602.07 MB -> Resident Set Size - total memory allocated for the process execution',
heapTotal: '2267.14 MB -> total size of the allocated heap',
heapUsed: '2243.09 MB -> actual memory used during the execution',

Solution:

Originally, I thought the worker pool was capped at 8 workers. However, we discovered that it actually starts with 8 and scales up as the workload increases, causing a spike in memory usage (RSS). In tests, we saw up to 20 workers spawn, leading to a significant memory increase.

Verification

After implementing these two fixes, the Resident Set Size (RSS) memory usage has stabilized at around 3GB using a Clipper plan of 32 days, and the heap stayed in the 100mb range. This indicates that garbage collection is now functioning effectively, as memory usage drops when expansions are rerun.