RelaxWorkChain should use restart_mode restart when max_wall

AndresOrtegaGuerrero commented 10 months ago

The RelaxWorkChain should do a restar_mode = 'restart' when the max wall is reach, instead is starting a simulations from scratch

mbercx commented 10 months ago

Thanks @AndresOrtegaGuerrero! Could you explain why you think the restart_mode should be set to 'restart'? I remember I proposed switching to the charge density in a comment on https://github.com/aiidateam/aiida-quantumespresso/pull/722, but we weren't sure restarting from the charge density/wave functions was a good idea (which soft of depends on where QE does the soft kill). So we left the current approach:

https://github.com/aiidateam/aiida-quantumespresso/blob/5db3b28ca5e067e63e59fdfdf6be8362efc7d223/src/aiida_quantumespresso/workflows/pw/base.py#L432-L449

Which does update the input structure for the next calculation, but restarts "from scratch", i.e. sets the following inputs:

https://github.com/aiidateam/aiida-quantumespresso/blob/5db3b28ca5e067e63e59fdfdf6be8362efc7d223/src/aiida_quantumespresso/workflows/pw/base.py#L281-L285

I suppose it might be more efficient to restart from the charge density/wave functions, but at the time I wasn't sure why the restart was set up this way, and hence felt it was better to sacrifice some efficiency instead of potentially breaking things or (even worse) producing incorrect results.

AndresOrtegaGuerrero commented 10 months ago

@mbercx So in case you are doing a long geometry optimization , and if QE did a max_walltime save, then QE can do a proper restart and continue the geometry optimization or even the SCF. However this only works if QE did the save correctly (So it is important to check in the output if indeed this was done). I have been doing optimizations with DFT+U (and using the workchain) and i noticed that without the proper restart the simulations can increase the consumption of time and resources as well. On the other hand it can also make a simulation that was relatively converging to a whole new one where it will take time to reach convergence. If the system has more than 100 atoms, then is quite important a proper restart , especially if the simulations was converging

mbercx commented 10 months ago

Yeah, I can see how for larger structures it can make a significant difference.

However this only works if QE did the save correctly (So it is important to check in the output if indeed this was done).

The error handler above only handles cases where Maximum CPU time exceeded was in the output:

https://github.com/aiidateam/aiida-quantumespresso/blob/17e173f11c75142755bc9f3c9a71160d5aba778c/src/aiida_quantumespresso/parsers/parse_raw/pw.py#L244

So in principle this should indicate a clean shutdown. I'd be inclined to accept switching to a full restart here with restart_mode == 'restart'. Do you see any issues with this @pietrodelugas?

AndresOrtegaGuerrero commented 10 months ago

@mbercx , it is important to guarantee that indeed qe did the clean shutdown, sometimes this shutdown doesnt take place (maybe because the time is not enough, or is an expensive calculation like a hybrid ).

unkcpz commented 5 months ago

@AndresOrtegaGuerrero, do you think more discussion is required? I didn't see big problem with set restart_mode=='restart'. The only concern I have is if the calculation is just hard to converge, hit the wall time will always make it restart then never will the workchain will finished. Can we set a max_restart to avoid the case?

AndresOrtegaGuerrero commented 5 months ago

I will try to use this restart only for the Relax workchain , and only when there is a MAX_WALLTIME

mbercx commented 5 months ago

If QE did a clean shutdown, I'm also fine with doing a full restart. @AndresOrtegaGuerrero I can quickly open a PR for this if you're not already doing so?

Can we set a max_restart to avoid the case?

The BaseRestartWorkChain already has a limit on how many times the work chain will restart (max_iterations, defaults to 5):

https://github.com/aiidateam/aiida-core/blob/06ea130df8854f621e25853af6ac723c37397ed0/src/aiida/engine/processes/workchains/restart.py#L143-L148

sphuber commented 5 months ago

@mbercx , it is important to guarantee that indeed qe did the clean shutdown, sometimes this shutdown doesnt take place (maybe because the time is not enough, or is an expensive calculation like a hybrid ).

I want to highlight this. It is possible, and I have seen it often in practice, where QE is initializing a soft-shutdown because the walltime was exceeded (and so it prints Maximum CPU time exceeded) but then it doesn't manage to write everything to disk in time and the scheduler still kills the job, leaving the restart files corrupt. So we cannot just check the message written by QE but have to verify that the scheduler also didn't kill. By default I think we take 95% of the allocated walltime of the scheduler job and set that as the soft walltime for QE. If the scenario above happens a lot, we should reduce this percentage to give QE more time to complete the shutdown.

mbercx commented 5 months ago

Thanks @sphuber, I see your point. But we already distinguish between ERROR_OUT_OF_WALLTIME (clean shutdown) and ERROR_OUT_OF_WALLTIME_INTERRUPTED, no?

https://github.com/aiidateam/aiida-quantumespresso/blob/b517686df8cd33765ef12cbf430620ea8e94f137/src/aiida_quantumespresso/parsers/pw.py#L159-L162

I suppose it's safe to assume that in case Maximum CPU time exceeded is in the stdout and JOB DONE is printed that the shutdown was a clean one?

sphuber commented 5 months ago

Thanks @sphuber, I see your point. But we already distinguish between ERROR_OUT_OF_WALLTIME (clean shutdown) and ERROR_OUT_OF_WALLTIME_INTERRUPTED, no?

https://github.com/aiidateam/aiida-quantumespresso/blob/b517686df8cd33765ef12cbf430620ea8e94f137/src/aiida_quantumespresso/parsers/pw.py#L159-L162

I suppose it's safe to assume that in case Maximum CPU time exceeded is in the stdout and JOB DONE is printed that the shutdown was a clean one?

That depends, if JOB DONE is only printed after QE has finished writing all state to disk, then yes. But I don't remember for sure if that is the case.

AndresOrtegaGuerrero commented 5 months ago

It would be nice to include a check if "JOB DONE" is printed, in case you have big system and a hybrid functional , then you can guarantee you have the wfn to do a restart , (even in a scfcalculation)

mbercx commented 5 months ago

It would be nice to include a check if "JOB DONE" is printed

This is already done, and in case it is missing ERROR_OUTPUT_STDOUT_INCOMPLETE is added to the error logs.

https://github.com/aiidateam/aiida-quantumespresso/blob/b517686df8cd33765ef12cbf430620ea8e94f137/src/aiida_quantumespresso/parsers/parse_raw/pw.py#L319-L325

I now notice that this is not done in case there is a CRASH file. I suppose the idea is that the failure can then be retrieved from this file and a more specific exit code can be returned? @bastonero do you remember why this was added in https://github.com/aiidateam/aiida-quantumespresso/commit/7f53c96bf744ed622fe7b1d13b0f7e9198ed5656?

But I don't remember for sure if that is the case.

I'm not 100% sure either, but it seems pretty reasonable, since the following line is printed before this:

Writing all to output data dir ./out/aiida.save/

So if the calculation would be interrupted during writing, I think the final line in the stdout should be this one and hence JOB DONE would not be printed.

bastonero commented 5 months ago

I now notice that this is not done in case there is a CRASH file. I suppose the idea is that the failure can then be retrieved from this file and a more specific exit code can be returned? @bastonero do you remember why this was added in 7f53c96?

Indeed, otherwise say there is a recoverable error (e.g. Cholesky), the stdout incomplete error would be exposed, hence making it unrecoverable. Nevertheless, if the run is interrupted while writing, there will be not CRASH file, as it is written only by QE when encounters an error.

I'm not 100% sure either, but it seems pretty reasonable, since the following line is printed before this:
Writing all to output data dir ./out/aiida.save/
So if the calculation would be interrupted during writing, I think the final line in the stdout should be this one and hence JOB DONE would not be printed.

Indeed, so it would be unrecoverable. Although, one might try to restart e.g. using the charge density only as a compromise, as it will probably the first and fastest thing that QE would dump. But nevertheless, wouldn't be worth it to try restarting either way? Or bad things would happen?

mbercx commented 5 months ago

Indeed, otherwise say there is a recoverable error (e.g. Cholesky), the stdout incomplete error would be exposed, hence making it unrecoverable.

This of course depends on how the PwParser deals with the errors in the logs and which one it decides to return. This logic is now a bit convoluted and sometimes hard to follow, but I'd say we can perfectly add ERROR_OUTPUT_STDOUT_INCOMPLETE to the logs in all cases and simply choose to return a more specific exit code instead. But that's a matter to deal with when refactoring the parser.

But nevertheless, wouldn't be worth it to try restarting either way? Or bad things would happen?

I was also wondering about this. If a file is corrupted, will QE crash when trying to read it, or simply ignore the file?

bastonero commented 5 months ago

That's true. I think in order to avoid issues we decided to opt for this solution, so that we were certain that it wouldn't return the ERROR_OUTPUT_STDOUT_INCOMPLETE. But that should be testable anyways

aiidateam / aiida-quantumespresso

RelaxWorkChain should use restart_mode restart when max_wall #968