Open martyvona opened 9 months ago
When I first worked on this about 6-8 weeks ago I got it to simulate the full 5y plan in something like 24G peak heap. However, more recent tests are failing again. I was looking into that at the point where I was asked to stop working on it. One of the things that changed recently is that the eurc model now is doing a better job of validating and running GNC turn activities where in previous versions it might have neither errored out nor actually simulated because of things like exceeding acceleration limits. So that same 5y plan that we have been working with now requires some tweaks to simulate at all, but once we make those, it uses a lot more memory than before because it's actually simulating more of the turns that it was before.
My latest result with that work was the 5y plan simulates about halfway through and then the simulation fails. Not becuse it ran out of heap, but actually because it hits some combination of activities at that point that were causing a concurrent modification error. Though I believe the heap memory usage was already in the mid 20Gs as well by that point, so if we fix that concurrent modification I would still be surprised if it could get a lot further without running out of memory.
Simulation failed. Response:
{'status': 'failed', 'reason': {'data': {'elapsedTime': '21307:04:36.000000', 'utcTimeDoy': '2027-122T07:04:36'}, 'message': '', 'timestamp': '2024-03-13T07:57:44.498576429Z.498576', 'trace': 'java.lang.UnsupportedOperationException: Conflicting concurrent effects on the same cell. Please disambiguate model to remove conflicting operations: SingletonEffect[effect=SetEffect[newDynamics=DiscreteDynamics(OFF)]] and SingletonEffect[effect=SetEffect[newDynamics=DiscreteDynamics(ON)]]
I was still working to understand what is still using a lot of memory but it was perhaps looking like about equal parts (a) resource timelines (even just a single segment of a scalar valued numeric resource timeline consumes something like 70 bytes iirc) and (b) simulation engine task frame histories.
This PR contains several adjustments to reduce memory usage for long plans. These were developed by analyzing memory dumps acquired while simulating a 5-year Clipper plan with about 7.5k activities. This plan was running out of memory in the default simulation configuration with the standard 32GB docker memory limit.
With the changes in this PR, combined with changes in a corresponding clipper-aerie PR, that plan now successfully simulates with a 32GB docker memory limit, I believe with at least a few GB to spare.
The changes in these PRs address several memory bottlenecks. In this specific Aerie PR those are:
ExtendedModelActions.spawn(RepeatingTask)
appears to be a form of tail recursion, but it without tall call optimization, this was using a lot of memory. There are several possible ways to address this; here I'm proposing a relatively minimal way that I feel is practical though perhaps not as elegant as some other more complex possibilities.SerializedValue
implementations intended to help avoid using a large number ofBigDecimal
instances in this codepath. It also add an opt-in feature that allows resource profiles to use a form of run-length compression.This is still a draft PR for now for the following reasons:
./gradlew e2eTest
, looks like there are some regressions I will need to fix there.~compare_resources.py
to make sure that there are no regressions in the simulation results. I suppose I will do this by running only some prefix of the test plan to stay within memory limits in the baseline configuration.