Closed giovannipizzi closed 3 months ago
Hi Giovanni,
I finally found this issue! This is indeed what I experienced a lot when vc-relaxing. Running my calculations, I saw that the damp dynamics is indeed more robust, but slow as well.
I think that a solution to the bfgs issue could be lowering the trust_radius_min , adding it in the input when restarting with e.g. one (or half) order of magnitude less. This is necessary since, as you wrote, the algorithm continue to "pinball" back and forth around the minimum at the same points, but this should be due to the fact that the steps are constrained to be "too large". One can also think of putting a lower limit under which this value should not go and then pass to damped dynamics.
Moreover, also the message "SCF correction compared to forces is large: reduce conv_thr to get better values" should be dealt by lowering the conv_thr. Maybe with an other handler? Even though should not be critical.
The options seem to be the following:
pros: maintains bfgs algorithm, should be faster
cons: could remain stuck in the wrong minimum
pros: easy and does not require changing any values
cons: could be too slow for the vc-relax case?
pros: probably the more robust (?)
cons: could take too many steps (?), exceeding the default 5 iterations max
Any way, the conv_thr should be dealt in some way and some tests are needed as well.
I believe this is fixed by https://github.com/aiidateam/aiida-quantumespresso/pull/985. Feel free to reopen in case I've missed something.
While running many PwRelax workflows, I encountered a number of 520 errors from the PwCalculation (
handle_relax_recoverable_ionic_convergence_error_bfgs
, that arises when in the output we havehistory already reset at previous step: stopping
from the BFGS algorithm).From my understanding, this comes when we are already quite close to the energy minimum and the "noise" on the gradient is too large so the algorithm doesn't know where to move, and stops. In practice, in most cases we have essentially already reached the minimum.
A practical solution in these cases would be to switch to damped dynamics, that while in general slower, are more robust and work fine in this case.
I've therefore implemented the following handler, whose diff is this:
(the diff is a bit messed up, in practice I added a new handler).
I have run ~14'000 relaxes (on very similar systems with fixed volume, i.e. not a vc-relax). I added this handler only after having run a few of them, that I then re-run with the handler. If I check in my whole DB with this code:
I get 172 workflows, 96 workflows with more than 1 child with state 520. If I get the status of one of them this looks like this:
and indeed I can confirm the problem continues to occur at each restart, "forever", until the max number of attempts (5) kicks in:
After adding the handler, I don't get anymore workflows with more than one subprocess with exit status 520. Indeed, in the group with the "final" calculations, I can run
and i get this:
They all look the same, where the 520 failure is recovered with a 0 exit status at the next step.
For instance in one case, from
!verdi process report 203121
I get that:And the analysis of this workflow reveals that, as expected:
ion_dynamics = 'damp'
in the inputhistory already reset at previous step: stopping
,P= 1496.54
,Total force = 0.000673
,Final energy = -2197.4265210269 Ry
(and it run for 10minutes). Note that it's OK to have a large pressure since these are runs at constrained volumes in my caseDamped Dynamics: convergence achieved in 1 steps
,P= 1496.55
,Total force = 0.000064
, (run for 1m20s).So in conclusion, the forces were reduced (in just one step) to go below the threshold. There is a message that the conv_thr might be high, but I think this is because I'm using a very tight force threshold; this might be the case, actually, of why this new handler is needed.
@mbercx could you please reuse the code snippets above, and check if the same handler would work for you as well in a few examples? In particular, it would be good if you could test vc-relax cases: I only tested relaxes, and for vc-relax there are two possible damp algorithms, so it would be good to test if the
damp-w
is a good safe choice.If you manage to test some tens of systems and it works, could you also please make a PR to add the handler? Thanks!