New handler for 520 pw errors (history already reset at previous step: stopping)

giovannipizzi commented 3 years ago

While running many PwRelax workflows, I encountered a number of 520 errors from the PwCalculation (handle_relax_recoverable_ionic_convergence_error_bfgs, that arises when in the output we have history already reset at previous step: stopping from the BFGS algorithm).

From my understanding, this comes when we are already quite close to the energy minimum and the "noise" on the gradient is too large so the algorithm doesn't know where to move, and stops. In practice, in most cases we have essentially already reached the minimum.

A practical solution in these cases would be to switch to damped dynamics, that while in general slower, are more robust and work fine in this case.

I've therefore implemented the following handler, whose diff is this:

     @process_handler(priority=560, exit_codes=[
         PwCalculation.exit_codes.ERROR_IONIC_CONVERGENCE_NOT_REACHED,
         PwCalculation.exit_codes.ERROR_IONIC_CYCLE_EXCEEDED_NSTEP,
+        ])
+    def handle_relax_recoverable_ionic_convergence_error(self, calculation):
+        """Handle various exit codes for recoverable `relax` calculations with failed ionic convergence.
+
+        These exit codes signify that the ionic convergence thresholds were not met, but the output structure is usable,
+        so the solution is to simply restart from scratch but from the output structure.
+        """
+        self.ctx.restart_calc = None
+        self.ctx.inputs.structure = calculation.outputs.output_structure
+        action = 'no ionic convergence but clean shutdown: restarting from scratch but using output structure.'
+        self.report_error_handled(calculation, action)
+        return ProcessHandlerReport(True)
+
+    @process_handler(priority=560, exit_codes=[
         PwCalculation.exit_codes.ERROR_IONIC_CYCLE_BFGS_HISTORY_FAILURE,
         PwCalculation.exit_codes.ERROR_IONIC_CYCLE_BFGS_HISTORY_AND_FINAL_SCF_FAILURE,
         ])
-    def handle_relax_recoverable_ionic_convergence_error(self, calculation):
+    def handle_relax_recoverable_ionic_convergence_error_bfgs(self, calculation):
         """Handle various exit codes for recoverable `relax` calculations with failed ionic convergence.

         These exit codes signify that the ionic convergence thresholds were not met, but the output structure is usable,
         so the solution is to simply restart from scratch but from the output structure.
+        In this case, as the BFGS history did not work, we restart but we set the ion_dynamics (and cell_dynamics) to 'damp'
         """
         self.ctx.restart_calc = None
+        self.ctx.inputs.parameters.setdefault('IONS', {})['ion_dynamics'] = 'damp'
+        if self.ctx.inputs.parameters['CONTROL']['calculation'] == 'vc-relax':
+            self.ctx.inputs.parameters.setdefault('CELL', {})['cell_dynamics'] = 'damp-w'            
         self.ctx.inputs.structure = calculation.outputs.output_structure
         action = 'no ionic convergence but clean shutdown: restarting from scratch but using output structure.'
         self.report_error_handled(calculation, action)

(the diff is a bit messed up, in practice I added a new handler).

I have run ~14'000 relaxes (on very similar systems with fixed volume, i.e. not a vc-relax). I added this handler only after having run a few of them, that I then re-run with the handler. If I check in my whole DB with this code:

from collections import defaultdict
from aiida import orm

filters = {}
# RELAX_GROUP_LABEL="XXX"
#filters = {'label': RELAX_GROUP_LABEL}

qb = orm.QueryBuilder()
qb.append(orm.Group, filters=filters, tag='group')
qb.append(orm.WorkflowNode, tag='relax',  with_group='group')
qb.append(orm.WorkflowNode, tag='base',  with_incoming='relax', project='id') # It's enough without edge filters because of the type constraints
qb.append(orm.CalcJobNode, tag='calc',  with_incoming='base', filters={'attributes.exit_status': 520}, project='id') # It's enough without edge filters because of the type constraints
db_extras = qb.all()
print(db_extras)

res = defaultdict(list)
for parent_pk, child_pk in db_extras:
    res[parent_pk].append(child_pk)

res_multiple = {k: v for k, v in res.items() if len(v) > 1}

print(f"{len(res)} workflows")
print(f"{len(res_multiple)} workflows with more than 1 child with state 520: {res_multiple}")

I get 172 workflows, 96 workflows with more than 1 child with state 520. If I get the status of one of them this looks like this:

$ verdi process status 191012
PwBaseWorkChain<191012> Finished [401] [7:results]
    ├── PwCalculation<191070> Finished [520]
    ├── PwCalculation<195895> Finished [520]
    ├── PwCalculation<198573> Finished [520]
    ├── PwCalculation<200630> Finished [520]
    └── PwCalculation<202111> Finished [520]

and indeed I can confirm the problem continues to occur at each restart, "forever", until the max number of attempts (5) kicks in:

(aiida-prod) pizzi@theospc36:~$ for i in 191070 195895 198573 200630 202111 ; do echo $i ; verdi calcjob outputcat $i | grep stop ; echo ; done
191070
     history already reset at previous step: stopping
     history already reset at previous step: stopping

195895
     history already reset at previous step: stopping
     history already reset at previous step: stopping

198573
     history already reset at previous step: stopping
     history already reset at previous step: stopping

200630
     history already reset at previous step: stopping
     history already reset at previous step: stopping

202111
     history already reset at previous step: stopping
     history already reset at previous step: stopping

(aiida-prod) pizzi@theospc36:~$ for i in 191070 195895 198573 200630 202111 ; do echo $i ; verdi calcjob outputcat $i | grep PWSCF | grep WALL ; echo ; done
191070
     PWSCF        :   8m25.44s CPU   9m 0.66s WALL

195895
     PWSCF        :   2m 2.43s CPU   2m11.57s WALL

198573
     PWSCF        :   2m 1.66s CPU   2m12.69s WALL

200630
     PWSCF        :   2m 4.31s CPU   2m14.40s WALL

202111
     PWSCF        :   1m58.34s CPU   2m 6.97s WALL

After adding the handler, I don't get anymore workflows with more than one subprocess with exit status 520. Indeed, in the group with the "final" calculations, I can run

from aiida import orm
RELAX_GROUP_LABEL="XXX"
qb = orm.QueryBuilder()
qb.append(orm.Group, filters={'label': RELAX_GROUP_LABEL}, tag='group')
qb.append(orm.WorkflowNode, tag='relax',  with_group='group')
qb.append(orm.WorkflowNode, tag='base',  with_incoming='relax', project=['id', 'extras.considered_handlers']) # It's enough without edge filters because of the type constraints
db_extras = qb.all()

all = []
for pk, considered in db_extras:
    if considered is not None:
        for this_cons in considered:
            for name, action in this_cons:
                if action is not None and name == 'handle_relax_recoverable_ionic_convergence_error_bfgs':
                    all.append((pk, action))

and i get this:

[(203121, {'do_break': True, 'exit_status': 0}),
 (371557, {'do_break': True, 'exit_status': 0}),
 (372211, {'do_break': True, 'exit_status': 0}),
 (437317, {'do_break': True, 'exit_status': 0}),
 (536811, {'do_break': True, 'exit_status': 0}),
 (539983, {'do_break': True, 'exit_status': 0}),
 (540181, {'do_break': True, 'exit_status': 0}),
 (560856, {'do_break': True, 'exit_status': 0}),
 (553735, {'do_break': True, 'exit_status': 0}),
 (553863, {'do_break': True, 'exit_status': 0}),
 (553887, {'do_break': True, 'exit_status': 0}),
 (553872, {'do_break': True, 'exit_status': 0}),
 (635113, {'do_break': True, 'exit_status': 0}),
 (644572, {'do_break': True, 'exit_status': 0}),
 (814375, {'do_break': True, 'exit_status': 0}),
 (814378, {'do_break': True, 'exit_status': 0}),
 (916157, {'do_break': True, 'exit_status': 0}),
 (916163, {'do_break': True, 'exit_status': 0}),
 (916178, {'do_break': True, 'exit_status': 0}),
 (934783, {'do_break': True, 'exit_status': 0}),
 (934764, {'do_break': True, 'exit_status': 0}),
 (934780, {'do_break': True, 'exit_status': 0}),
 (952685, {'do_break': True, 'exit_status': 0}),
 (952673, {'do_break': True, 'exit_status': 0})]

They all look the same, where the 520 failure is recovered with a 0 exit status at the next step.

For instance in one case, from !verdi process report 203121 I get that:

PwCalculation<203159> failed with exit status 520: The ionic minimization cycle terminated prematurely because of two consecutive failures in the BFGS algorithm.
launching PwCalculation<211258> iteration #2

And the analysis of this workflow reveals that, as expected:

job 203159: no specific IONS input (i.e. the default, BFGS);
job 211258: has correctly ion_dynamics = 'damp' in the input
job 203159: has history already reset at previous step: stopping, P= 1496.54, Total force = 0.000673, Final energy = -2197.4265210269 Ry (and it run for 10minutes). Note that it's OK to have a large pressure since these are runs at constrained volumes in my case
job 211258: Damped Dynamics: convergence achieved in 1 steps, P= 1496.55, Total force = 0.000064, (run for 1m20s).
note: in job 211258 I also have the message: "SCF correction compared to forces is large: reduce conv_thr to get better values")

So in conclusion, the forces were reduced (in just one step) to go below the threshold. There is a message that the conv_thr might be high, but I think this is because I'm using a very tight force threshold; this might be the case, actually, of why this new handler is needed.

@mbercx could you please reuse the code snippets above, and check if the same handler would work for you as well in a few examples? In particular, it would be good if you could test vc-relax cases: I only tested relaxes, and for vc-relax there are two possible damp algorithms, so it would be good to test if the damp-w is a good safe choice.

If you manage to test some tens of systems and it works, could you also please make a PR to add the handler? Thanks!

bastonero commented 3 years ago

Hi Giovanni,

I finally found this issue! This is indeed what I experienced a lot when vc-relaxing. Running my calculations, I saw that the damp dynamics is indeed more robust, but slow as well.

I think that a solution to the bfgs issue could be lowering the trust_radius_min , adding it in the input when restarting with e.g. one (or half) order of magnitude less. This is necessary since, as you wrote, the algorithm continue to "pinball" back and forth around the minimum at the same points, but this should be due to the fact that the steps are constrained to be "too large". One can also think of putting a lower limit under which this value should not go and then pass to damped dynamics.

Moreover, also the message "SCF correction compared to forces is large: reduce conv_thr to get better values" should be dealt by lowering the conv_thr. Maybe with an other handler? Even though should not be critical.

The options seem to be the following:

lowering trust_radius_min

pros: maintains bfgs algorithm, should be faster

cons: could remain stuck in the wrong minimum

changing to damped dynamics:

pros: easy and does not require changing any values

cons: could be too slow for the vc-relax case?

mixed schema, i.e. first lower the minimum trust radius, when too low, pass to damped dynamics

pros: probably the more robust (?)

cons: could take too many steps (?), exceeding the default 5 iterations max

Any way, the conv_thr should be dealt in some way and some tests are needed as well.

mbercx commented 3 months ago

I believe this is fixed by https://github.com/aiidateam/aiida-quantumespresso/pull/985. Feel free to reopen in case I've missed something.

aiidateam / aiida-quantumespresso

New handler for 520 pw errors (history already reset at previous step: stopping) #637