DARMA-tasking / vt

DARMA/vt => Virtual Transport
Other
35 stars 9 forks source link

Avoid large memory footprint of subphase load data #1242

Open nlslatt opened 3 years ago

nlslatt commented 3 years ago

What Needs to be Done?

If the app doesn't reset the subphase number appropriately on every rank on each timestep, the use of dense vectors to store subphase loads will cause problems. The size of the stats data structures will increase even if a fixed number of phases worth of data is retained. This will cause the memory footprint and migration cost to increase. It will also cause many subphases worth of zero data to be output to stats files for every object on every timestep, making the sizes of the stats files unmanageable.

This problem was realized in empire and went unnoticed for a very long time. The subphase number was only being reset on rank 0. This would break load models that consider subphase loads. I'm not sure if there's a good way to call out that the subphase numbers don't match between rank 0 and other ranks.

We should consider changing the type of data structure used for storing subphase loads and/or forcing the subphase number to reset when the phase increments so that other apps don't experience this same problem.

@lifflander @PhilMiller @ppebay @bathmatt

Is your feature request related to a problem? Please describe.

Describe potential solution outcome

Describe alternatives you've considered

Additional context

nlslatt commented 3 years ago

vt will automatically reset the subphase to zero when incrementing the phase.

nlslatt commented 3 years ago

@lifflander I tested this change on the release branch and it did not fix the empire problem because empire manages the subphase on its own, only writing subphase to vt and never reading it from vt. Are there any downsides to having apps call into ElementStats to increase the subphase number by n instead of performing that operation locally and then overwriting the vt subphase?

nlslatt commented 3 years ago

The problem that inspired opening this issue has been addressed on the empire end, so this isn't time sensitive.