fmihpc / vlasiator

Vlasiator - ten letters you can count on
https://www.helsinki.fi/en/researchgroups/vlasiator
Other
45 stars 37 forks source link

Restart writing upon bailout not always working #53

Closed ykempf closed 9 years ago

ykempf commented 9 years ago

This issue is to record observations when bailout does not do its job properly. Below follows one example but simpler testing did not reveal the bug. If anyone observes anomalous behaviour of the bailout mechanism, please report it here.

First example: BBD Magnetosphere run on 80 nodes on voima. Restart at 300s, gets instable, NaNs get created, bailout at the next file writeout because 2 processes detected NaNs. Clean bailout but no restart written out, although the flags are not set and the default is 1/true, i.e. that a restart should be written on a basic bailout.

ykempf commented 9 years ago

/univ_2/ws3/ws/ipryakem-BCC-0/run just failed upon STOP, flags set but restart not written. One note: restart file flags were invalid because I ised the queue wizard cfg with a regular submitted job.

ykempf commented 9 years ago

I think I spotted the problem at least for the latter case. BBD is not on record any more so I cannot check there.

In the Hornet run I had invalid restart flags so that it defaulted to -1. Thus the logic in https://github.com/fmihpc/vlasiator/blob/master/vlasiator.cpp#L487 performed flawlessly. I'll test and propose a patch shortly.

ykempf commented 9 years ago

Fixed in #126, if there is any new occurrences then we can reopen this.