Implement `base_restart` for KKRcalculation to fix known / common issues

Use the base_restart functionality of aiida-core v1.1.0 to overcome common technical problems and common problems with the input that can be fixed easily and automatically.

A collection of the necessary 'expert knowledge' can be found here: https://iffmd.fz-juelich.de/dRKTeQwmS8uThlwkUJowyw?view

@PhilippRue I have started the implementation of this for the host code and for the voronoi code. I think that many of the features in kkr_scr and voro_start can be added handled in this way as they would basically make use of the process_handler to handle these states.

However, looking at the way in which you have designed the kkr_scf it is my impression that that is not the plan you have on mind. E.g. normally I would handle the lack of convergence of the SCF state by using the process_handler to increase/decrease the mixing, number of steps, etc. But the kkr_scf takes care of this basically, so I wonder how much of the kkr_scf should be in the BaseRestartWorkchain and how much should be left outside?

A similar question happens for the voro_start where you have a lot of the error handling in the voro_start, so the question is, how much of that should be handled via the BaseRestartWorkchain and how much should be done manually?

When these workflows were developed there was no base restart functionality. That's why all error handling that is done so far is in there. Maybe it still makes sense to separate technical issues (out of memory errors or dimension errors if array dimensions are too small) from convergence issues (cluster radius). We could for example inherit from base restart in the kkr_scf and extend the process_handler functionality to handle the mixing parameter etc. However this is probably a major refactoring of the kkr_scf and voro_start workflows.

In that way we could simply use the base restart instead of run a KKR calculation in the workflows like dos where all the convergence issues etc are not present.

To sum up I suggest this structure:

BaseRestartWorkchain: only technical issues that are helpful for both KKR and voronoi (i.e. not code specific)
RestartKKR: inherits from BaseRestartWorkChain and add KKR specific things (array dimensions etc). In the same way RestartVoro could deal with the voronoi code.
kkr_scf inherits from RestartKKR and extend the pricess_handler to deal with the convergence issues. In the beginning this could also just call RestartKKR instead of a KkrCalculation until the refactoring process is complete

Let me know what you think.

When these workflows were developed there was no base restart functionality. That's why all error handling that is done so far is in there.

Sure I understand, it is what made sense to do in that time.

Maybe it still makes sense to separate technical issues (out of memory errors or dimension errors if array dimensions are too small) from convergence issues (cluster radius). We could for example inherit from base restart in the kkr_scf and extend the process_handler functionality to handle the mixing parameter etc. However this is probably a major refactoring of the kkr_scf and voro_start workflows.

That could be, however how we detect the out of memory errors? Is that something that is detected by the parser? or that it can be calculated on the fly? Normally one can get these ones on the stdout or stderr, which I do not know are parsed right now.

In that way we could simply use the base restart instead of run a KKR calculation in the workflows like dos where all the convergence issues etc are not present.

Yes I agree, the base_kkr should in principle be "dumb" i.e. just handle running a single kkr calculation in the simplest way possible.

To sum up I suggest this structure:

BaseRestartWorkchain: only technical issues that are helpful for both KKR and voronoi (i.e. not code specific)

RestartKKR: inherits from BaseRestartWorkChain and add KKR specific things (array dimensions etc). In the same way RestartVoro could deal with the voronoi code.

kkr_scf inherits from RestartKKR and extend the pricess_handler to deal with the convergence issues. In the beginning this could also just call RestartKKR instead of a KkrCalculation until the refactoring process is complete

Let me know what you think.

I think it is something that might work, however, I wonder about the complexity, if there are three workchains that in principle handle how to do a KKR calculation (90% of the time you want to do a SCF calculation) I wonder if this would lead to confusion in their usage and or expansion. I think that the approach of using the BaseRestartWorkchain as done by the QE guys to be the simplest , I think that if you put that to be a RestartKKR(BaseRestartWorkchain) which inside calls as a process the base_kkr workchain, things can get very complicated quite fast.

I'm open to help with any solution, but I would like to coordinate with you to ensure that things are well planned out.

That could be, however how we detect the out of memory errors? Is that something that is detected by the parser? or that it can be calculated on the fly? Normally one can get these ones on the stdout or stderr, which I do not know are parsed right now.

Parsing the stderr is an option but at least for slurm there exist some parsing in aiida-core already. However I don't know about the other schedulers.

I think it is something that might work, however, I wonder about the complexity, if there are three workchains that in principle handle how to do a KKR calculation (90% of the time you want to do a SCF calculation) I wonder if this would lead to confusion in their usage and or expansion. I think that the approach of using the BaseRestartWorkchain as done by the QE guys to be the simplest , I think that if you put that to be a RestartKKR(BaseRestartWorkchain) which inside calls as a process the base_kkr workchain, things can get very complicated quite fast.

I'm open to help with any solution, but I would like to coordinate with you to ensure that things are well planned out.

Yes, maybe that structure is clearer. So the idea would be to use BaseRestartWorkchain from aiida-core in BaseKKR(BaseRestartWorkchain) which replaces the kkr_scf workchain, right?

Then whenever a KkrCalculation is used now we just use BaseKKR instead with some settings to not do the scf functionality?

That could be, however how we detect the out of memory errors? Is that something that is detected by the parser? or that it can be calculated on the fly? Normally one can get these ones on the stdout or stderr, which I do not know are parsed right now.

Parsing the stderr is an option but at least for slurm there exist some parsing in aiida-core already. However I don't know about the other schedulers.

Right had forgotten about that one, need to check if it works for other schedulers.

I think it is something that might work, however, I wonder about the complexity, if there are three workchains that in principle handle how to do a KKR calculation (90% of the time you want to do a SCF calculation) I wonder if this would lead to confusion in their usage and or expansion. I think that the approach of using the BaseRestartWorkchain as done by the QE guys to be the simplest , I think that if you put that to be a RestartKKR(BaseRestartWorkchain) which inside calls as a process the base_kkr workchain, things can get very complicated quite fast. I'm open to help with any solution, but I would like to coordinate with you to ensure that things are well planned out.

Yes, maybe that structure is clearer. So the idea would be to use BaseRestartWorkchain from aiida-core in BaseKKR(BaseRestartWorkchain) which replaces the kkr_scf workchain, right?

Then whenever a KkrCalculation is used now we just use BaseKKR instead with some settings to not do the scf functionality?

Yes that is what I was thinking, one needs to be clever on how to ensure that if you want to use the single KKR calculation that does not require scf, you do not end up in a situation in which the process handler is activated, so this is something that one must handle in the parser state.

I think the tricky thing is that in other codes, such as QE and VASP, the base workchain is used in all the other workchains (such as dos, band structure, etc) to create a turn key solution, thing which is not the case here. So one needs to be quite careful on how to ensure that those functionalities are not broken, if one wishes to set the main process to be the BaseKKR

Yes that is what I was thinking, one needs to be clever on how to ensure that if you want to use the single KKR calculation that does not require scf, you do not end up in a situation in which the process handler is activated, so this is something that one must handle in the parser state.

I think the tricky thing is that in other codes, such as QE and VASP, the base workchain is used in all the other workchains (such as dos, band structure, etc) to create a turn key solution, thing which is not the case here. So one needs to be quite careful on how to ensure that those functionalities are not broken, if one wishes to set the main process to be the BaseKKR

Yes, but there is a limited number of things that can be seen when parsing the output. In most cases the shape of the energy contour (e.g. NPOL=0 is always special) or the number of iterations (non-scf runs usually have NSTEPS=1) are special for the non-scf runs. Thus we can come up with a certain set of rules when the scf functionality is by default deactivated + we should provide the possibility to overwrite this in the inputs to the base workchain.

But on the bright side we would get closer to having turn-key workflows ;)

Yes that is what I was thinking, one needs to be clever on how to ensure that if you want to use the single KKR calculation that does not require scf, you do not end up in a situation in which the process handler is activated, so this is something that one must handle in the parser state. I think the tricky thing is that in other codes, such as QE and VASP, the base workchain is used in all the other workchains (such as dos, band structure, etc) to create a turn key solution, thing which is not the case here. So one needs to be quite careful on how to ensure that those functionalities are not broken, if one wishes to set the main process to be the BaseKKR

Yes, but there is a limited number of things that can be seen when parsing the output. In most cases the shape of the energy contour (e.g. NPOL=0 is always special) or the number of iterations (non-scf runs usually have NSTEPS=1) are special for the non-scf runs. Thus we can come up with a certain set of rules when the scf functionality is by default deactivated + we should provide the possibility to overwrite this in the inputs to the base workchain.

But on the bright side we would get closer to having turn-key workflows ;)

True ;) I'll be working on the refactoring a bit to see what can be done. I'll keep you posted on this.

JuDFTteam / aiida-kkr

Implement `base_restart` for KKRcalculation to fix known / common issues #34