Mitigate Sequencing Server memory leak by limiting max number of workers

Tickets addressed: Closes #1475
Review: By commit
Merge strategy: Merge (no squash)

Description

@cartermak reported that an expansion run for TT-8 is failing. Upon further investigation, we have found this:

Memory Leak:

We identified a memory leak issue where memory is not being released properly after an expansion run. Imagine this scenario:

Expansion A finishes and uses 1 GB of memory. Expansion B starts, but instead of reclaiming the memory used by A, it allocates an additional 1 GB, bringing the total used memory to 2 GB. This pattern continues with each subsequent expansion, causing a cumulative memory increase. Eventually, the server runs out of memory and crashes.

High Memory Usage (RSS):

During testing with the problematic setup, we observed a significant spike in Resident Set Size (RSS) memory (~11GB). RSS represents the total amount of physical memory actively used by a process. This spike indicates inefficient memory utilization, even beyond the leak issue.

rss: '5602.07 MB -> Resident Set Size - total memory allocated for the process execution',
heapTotal: '2267.14 MB -> total size of the allocated heap',
heapUsed: '2243.09 MB -> actual memory used during the execution',

Solution:

We updated the aerie-user-ts-code-runner plugin with a two-line code change to fix a memory leak issue.
We’re also adding a new configuration option (“knob”) to the Docker Compose file for the sequencing server that allows you to set a maximum number of workers.

Originally, I thought the worker pool was capped at 8 workers. However, we discovered that it actually starts with 8 and scales up as the workload increases, causing a spike in memory usage (RSS). In tests, we saw up to 20 workers spawn, leading to a significant memory increase.

Verification

After implementing these two fixes, the Resident Set Size (RSS) memory usage has stabilized at around 3GB using a Clipper plan of 32 days, and the heap stayed in the 100mb range. This indicates that garbage collection is now functioning effectively, as memory usage drops when expansions are rerun.

NASA-AMMOS / aerie

Mitigate Sequencing Server memory leak by limiting max number of workers #1476

Description

Verification