fmihpc / vlasiator

Vlasiator - ten letters you can count on
https://www.helsinki.fi/en/researchgroups/vlasiator
Other
47 stars 38 forks source link

Load balance at restart is memory-intensive #1013

Open ykempf opened 1 month ago

ykempf commented 1 month ago

The scheme of our restart reading is inefficient in terms of memory high-water mark at least.

We read block counts and try to spread that evenly, but then the load balance will reshuffle things based on the LB_WEIGHT that's read in in the second stage. And that leads to massive rejigging of MPI domains and a significant peak in HWM. I assume this grew organically but it would seem more logical to simply read in the LB_WEIGHT and balance according to that, then read in, that should reduce the initial memory peak seen in current investigations.

It's not impossible I will file a patch soonish on this, but I have certain manuscript waiting for me... If anyone picks this up, I'll be grateful. :)

ykempf commented 1 month ago

Now as @markusbattarbee pointed out, it would be a tricky mesh of small reads instead of the current sequential approach, so maybe not worth bothering with right now. If it can't reshuffle at restart it probably won't fit well in memory at runtime either.