NASA-AMMOS / aerie

A software framework for modeling spacecraft.
https://nasa-ammos.github.io/aerie-docs/
MIT License
73 stars 19 forks source link

Sequencing Server Memory Leak #1475

Closed goetzrrGit closed 5 months ago

goetzrrGit commented 5 months ago

Checked for duplicates

Yes - I've already checked

Is this a regression?

No - This is a new bug

Version

=2.11

Describe the bug

@cartermak reported that an expansion run for TT-8 is failing. Upon further investigation, we have found this:

  1. Memory Leak:

We identified a memory leak issue where memory is not being released properly after an expansion run. Imagine this scenario:

Expansion A finishes and uses 1 GB of memory. Expansion B starts, but instead of reclaiming the memory used by A, it allocates an additional 1 GB, bringing the total used memory to 2 GB. This pattern continues with each subsequent expansion, causing a cumulative memory increase. Eventually, the server runs out of memory and crashes.

  1. High Memory Usage (RSS):

During testing with the problematic setup, we observed a significant spike in Resident Set Size (RSS) memory (~11GB). RSS represents the total amount of physical memory actively used by a process. This spike indicates inefficient memory utilization, even beyond the leak issue.

rss: '5602.07 MB -> Resident Set Size - total memory allocated for the process execution',
heapTotal: '2267.14 MB -> total size of the allocated heap',
heapUsed: '2243.09 MB -> actual memory used during the execution',

Reproduction

Run sequence expansion and measure the memory

Logs

No response

System Info

Sequencing Server

Severity

Critical

goetzrrGit commented 5 months ago

Our finding and fix:

Originally, I thought the worker pool was capped at 8 workers. However, we discovered that it actually starts with 8 and scales up as the workload increases, causing a spike in memory usage (RSS). In tests, we saw up to 20 workers spawn, leading to a significant memory increase.