Restart option for very time consuming jobs

rui-coelho commented 4 months ago

Is there an option (seems not...) to have restart jobs when running ASCOT5 using MPI=1 for very demanding runs e.g. fine 5D grids + 10M markers ? On a typical tokamak axisymmetric case i'm running with 10M markers and grid resolution

"DIST_NBIN_R":100,"DIST_NBIN_Z":180,"DIST_NBIN_PPA":150,"DIST_NBIN_PPE":100

i am getting an estimate runtime from the stdout of ~170h on 22 nodes (1056 cores). Even if stdout might not be fully accurate from my experience it isn't that far out from the actual value either.

I would envision a sequenced like approach of checkpoint+restart where one could indicate in the data.options the "progress" stopping value. During the code execution one reads in the stdout something like:

Progress: 649/61653, 1.05 %. Time spent: 0.33 h, estimated time to finish: 31.33 h

Even if not "precise", ASCOT knows how much it needs to step up the marker evolution since in this particular case 61653*(8 nodes) = 493224 which is close to the 500k markers i set in the run. In the long run i am (trying) to simulate the stdout reads

Progress: 20240/448495, 4.51 %. Time spent: 7.68 h, estimated time to finish: 162.57 h

and this makes sense again since 448495*(22 nodes) = 9866890 which is close to the 10M markers i set for the simulation. So, in principle it should be doable to know when to "checkpoint" the run.......as for the restart.....good question...

miekkasarki commented 4 months ago

I support this feature. We already have a "worker" thread that monitors the simulation and whose only job right now is to print that progress file. In principle, we could signal that thread that "please stop my simulation ASAP" and that worker thread then sets MAX_CPU_TIME end condition for all markers that are currently being simulated or whose simulation has not yet started. Then the whole particle queue would be flushed within a minute or so, and you would get your intermediate results stored in the HDF5 file.

As for the signal, what you suggest would probably work in your case where the progress meter can be trusted. However, I would prefer that the job would be terminated gracefully in two cases:

SLURM signals that the job is approaching its time limit where it would be forcefully terminated
User wants to terminate the run earlier and sends the signal him/herself

These would work for you, right? The signal could be something as simple as creating a file called "stop" in the same folder where the job was launched. However, there are two open questions:

How can we make SLURM to generate such a file or could we pass the time limit from SLURM to ascot somehow at the beginning of the simulation?
What if the user is running multiple simulations in same folder? In this case the file should be something like "stop_" and again we would have to communicate JOBID from SLURM to ascot somehow.

rui-coelho commented 4 months ago

I was thinking of something really much more basic. Imagine we have an initial value code to simulate the time evolution of an instability. We know we are going overboard in terms of maximum runtime and if we are on MARCONI this mans typically 24h. I first need to have a rough estimate of how much time steps this translates to and then i can set the number of time steps accordingly. I then set the first run to do 1M time steps (1-1,000,000), the second run will do from 1,000,000 to 2,000,000 and so on and so forth. Now, if ASCOT runs the markers "sequentially" i.e. dispatching let's say 1000 markers until the end condition is met, then the next 1000 and so on......one could "trivially" instruct the code to only dispatch the "first" 1,000,000 markers and then store the result in the HDF5 file. The next call to ASCOT, however, would have to know which set of 1,000,000 markers was dealt with and then dispatch the next set of 1,000,000 markers. Very likely, for this to work, one should have an extra OPTIONS key to specify what "sequence number" of the multi-stage run we are running so that the stupid code could know which set of 1,000,000 markers to launch in the run.....and of course the number of markers to "push" on each "sequence" should also be an OPTION key in the dictionary......

rui-coelho commented 4 months ago

....since ASCOT does not do beam-beam reaction it should be doable to implement since in reality once a given marker meets it's end (poor guy....) it R.I.P right ?

So....we could potentially break up a run that has 10M markers in 10 runs of 1M each or 20 runs of 500k each (the number of markers i have been using more frequently...) in sequence and update the hdf5 file as the sequences evolved....and since the markers are all "tagged" with metadata we could even check which ones have met their fate and which ones are waiting to go to the slaughter.....(too much jambon hanging and eating while at Salamanca....apologies for the analogies...)

ascot4fusion / ascot5

Restart option for very time consuming jobs #114