reduce simulation memory usage for long plans

NASA-AMMOS / aerie

A software framework for modeling spacecraft.

MIT License

73 stars 19 forks source link

This PR contains several adjustments to reduce memory usage for long plans. These were developed by analyzing memory dumps acquired while simulating a 5-year Clipper plan with about 7.5k activities. This plan was running out of memory in the default simulation configuration with the standard 32GB docker memory limit.

With the changes in this PR, combined with changes in a corresponding clipper-aerie PR, that plan now successfully simulates with a 32GB docker memory limit, I believe with at least a few GB to spare.

The changes in these PRs address several memory bottlenecks. In this specific Aerie PR those are:

The "trampolining" approach used for runtime performance in Clipper's ExtendedModelActions.spawn(RepeatingTask) appears to be a form of tail recursion, but it without tall call optimization, this was using a lot of memory. There are several possible ways to address this; here I'm proposing a relatively minimal way that I feel is practical though perhaps not as elegant as some other more complex possibilities.
The current implementation of simulation results builds up the entire history of all resource profiles in memory. While there is some intention that this will eventually be replaced by streaming the data incrementally in some form, for now this is a memory bottleneck. This PR adds some new SerializedValue implementations intended to help avoid using a large number of BigDecimal instances in this codepath. It also add an opt-in feature that allows resource profiles to use a form of run-length compression.

This is still a draft PR for now for the following reasons:

I would still like to run some additional tests with docker configured with memory limits that more closely mimic actual deployments
DONE ~I didn't know about ./gradlew e2eTest, looks like there are some regressions I will need to fix there.~
I still need to run @DavidLegg's compare_resources.py to make sure that there are no regressions in the simulation results. I suppose I will do this by running only some prefix of the test plan to stay within memory limits in the baseline configuration.

Simulation failed. Response: {'status': 'failed', 'reason': {'data': {'elapsedTime': '21307:04:36.000000', 'utcTimeDoy': '2027-122T07:04:36'}, 'message': '', 'timestamp': '2024-03-13T07:57:44.498576429Z.498576', 'trace': 'java.lang.UnsupportedOperationException: Conflicting concurrent effects on the same cell. Please disambiguate model to remove conflicting operations: SingletonEffect[effect=SetEffect[newDynamics=DiscreteDynamics(OFF)]] and SingletonEffect[effect=SetEffect[newDynamics=DiscreteDynamics(ON)]]

NASA-AMMOS / aerie

reduce simulation memory usage for long plans #1337