NASA-AMMOS / aerie

A software framework for modeling spacecraft.
https://nasa-ammos.github.io/aerie-docs/
MIT License
73 stars 19 forks source link

reduce simulation memory usage for long plans #1337

Open martyvona opened 9 months ago

martyvona commented 9 months ago

This PR contains several adjustments to reduce memory usage for long plans. These were developed by analyzing memory dumps acquired while simulating a 5-year Clipper plan with about 7.5k activities. This plan was running out of memory in the default simulation configuration with the standard 32GB docker memory limit.

With the changes in this PR, combined with changes in a corresponding clipper-aerie PR, that plan now successfully simulates with a 32GB docker memory limit, I believe with at least a few GB to spare.

The changes in these PRs address several memory bottlenecks. In this specific Aerie PR those are:

This is still a draft PR for now for the following reasons:

  1. I would still like to run some additional tests with docker configured with memory limits that more closely mimic actual deployments
  2. DONE ~I didn't know about ./gradlew e2eTest, looks like there are some regressions I will need to fix there.~
  3. I still need to run @DavidLegg's compare_resources.py to make sure that there are no regressions in the simulation results. I suppose I will do this by running only some prefix of the test plan to stay within memory limits in the baseline configuration.
martyvona commented 8 months ago

When I first worked on this about 6-8 weeks ago I got it to simulate the full 5y plan in something like 24G peak heap. However, more recent tests are failing again. I was looking into that at the point where I was asked to stop working on it. One of the things that changed recently is that the eurc model now is doing a better job of validating and running GNC turn activities where in previous versions it might have neither errored out nor actually simulated because of things like exceeding acceleration limits. So that same 5y plan that we have been working with now requires some tweaks to simulate at all, but once we make those, it uses a lot more memory than before because it's actually simulating more of the turns that it was before.

My latest result with that work was the 5y plan simulates about halfway through and then the simulation fails. Not becuse it ran out of heap, but actually because it hits some combination of activities at that point that were causing a concurrent modification error. Though I believe the heap memory usage was already in the mid 20Gs as well by that point, so if we fix that concurrent modification I would still be surprised if it could get a lot further without running out of memory.

Simulation failed. Response:
{'status': 'failed', 'reason': {'data': {'elapsedTime': '21307:04:36.000000', 'utcTimeDoy': '2027-122T07:04:36'}, 'message': '', 'timestamp': '2024-03-13T07:57:44.498576429Z.498576', 'trace': 'java.lang.UnsupportedOperationException: Conflicting concurrent effects on the same cell. Please disambiguate model to remove conflicting operations: SingletonEffect[effect=SetEffect[newDynamics=DiscreteDynamics(OFF)]] and SingletonEffect[effect=SetEffect[newDynamics=DiscreteDynamics(ON)]]

I was still working to understand what is still using a lot of memory but it was perhaps looking like about equal parts (a) resource timelines (even just a single segment of a scalar valued numeric resource timeline consumes something like 70 bytes iirc) and (b) simulation engine task frame histories.