Open AndresOrtegaGuerrero opened 1 year ago
Thanks for raising the issue @AndresOrtegaGuerrero! Note that in case the electronic convergence fails, the PwBaseWorkChain
reduces the mixing_beta
:
However, we have already discussed in the past that:
mixing_beta
is too high:Increasing the number of electronic steps electron_maxstep
might also be sensible, but perhaps only in case we see that the SCF is converging?
@mbercx I like the idea that it checks the output and it takes a decision upon the scf convergence and/change. But definitively to not do 4 times a restart that will end up in failure.
Hi together! @mbercx Just out of curiosity, have you also discussed to include mixing_ndim
in the past? I included this parameter in my own workchains on top of the plugin. It will be also increased in steps of 4 (up to a given maximum) in case the mixing_beta
is already quite small. I think that I remember some calculations where it seems that this actually helped. Of course, increasing mixing_ndim
also increases the memory usage so this might be the reason why this is not included per default. If I remember correctly, increasing mixing_ndim
and decreasing mixing_beta
might increase the necessary steps to converge and for that reason I also increase electron_maxstep
in case mixing_ndim
is increased.
But definitively to not do 4 times a restart that will end up in failure.
Yeah, I've also done some tests for this. It's rare that it suddenly works after e.g. 2 restarts.
Just out of curiosity, have you also discussed to include mixing_ndim in the past?
Thanks @t-reents! I haven't played around with this setting much, to be honest, as the default (8) already seemed quite high? But it may be worth testing.
Another approach which we still have to fully integrate here is the "direct minimization" method implemented in SIRIUS by @ simonpintarelli:
https://github.com/simonpintarelli/nlcglib
I've tested the robustness of this alternative approach of finding the electronic ground state quite extensively in the past, and it is very promising.
@cpignedoli Could you share also your experience on these restarts ?
Hello everyone, I am also interested in this discussion as I would like to improve the handler too.
As I collected over ~3 years some data, I performed some "scientific" analysis on PwCalculation
that finished correctly (exit status 0) and the one having issues with the electronic convergence (exit status 410).
In the following plot I report the counts for the slope of the convergence, computed using a linear fit (steps vs. log(estimated scf convergence)). convergence_statistics.pdf You can reproduce with the snippet below. It would be good to collect some more statistics, especially on the failed ones. My statistics tells that the failed run have slopes ~ -0.1 or greater, whereas -0.2 starts to get borderline, as you can see, while lower values mean convergence will be achieved.
TAKE-HOME MESSAGE: as a rule of thumb one can compute the slope, and if it is lower than -0.1/-0.2, then it is probably worth it to increase the steps, otherwise don't and try a different strategy (e.g. change mixing mode, mixing beta, or something else).
from aiida import load_profile
from aiida.orm import *
import numpy as np
import matplotlib.pyplot as plt
load_profile()
MAX_NODE_COUNT = 10000
def linear(x,*args):
return args[0] +args[1]*x
def fit_value(kn, array, guess):
from scipy.optimize import curve_fit
params, cov = curve_fit(linear, kn, array, p0=[1, (array[0]-array[-1])/(kn[0]-kn[-1])])
return params, cov
q = QueryBuilder()
q.append(CalcJobNode, filters={
'attributes.exit_status': {'in':[0]},
'attributes.process_state': 'finished',
'attributes.process_label': 'PwCalculation',
})
print("Tot. PwCalculation: ", q.count())
slopes_ok = []
count = 0
for n in q.iterall():
try:
y = n[0].tools.get_scf_accuracy(0)
x = np.array(list(range(len(y))))+1
params, cov = fit_value(x, np.log(y), 0)
slopes_ok.append(params[1])
except:
pass
count+=1
if count > MAX_NODE_COUNT:
break
q = QueryBuilder()
q.append(CalcJobNode, filters={
'attributes.exit_status': {'in':[410]},
'attributes.process_state': 'finished',
'attributes.process_label': 'PwCalculation',
})
print("Tot. PwCalculation: ", q.count())
slopes_fail = []
count = 0
for n in q.iterall():
try:
y = n[0].tools.get_scf_accuracy(0)
x = np.array(list(range(len(y))))+1
params, cov = fit_value(x, np.log(y), 0)
slopes_fail.append(params[1])
except:
pass
count+=1
if count > MAX_NODE_COUNT:
break
plt.hist(slopes_ok, bins=120, label='Converged')
plt.hist(slopes_fail, bins=120, label='Failed')
plt.xlabel('Number of SCF')
plt.ylabel('Convergence slope')
plt.legend()
plt.savefig('./convergence_statistics.pdf', dpi=300, transparent = True, pad_inches = .1, bbox_inches = 'tight')
Nice, thanks @bastonero! Once I get to my computer I'll check for the MC3D runs, that should give us some more statistics ^^
Statistics on the MC3D rSCF runs (first iterations (mixing_beta
= 0.4), structures obtained directly from the databases, lanthanides have been avoided here):
I was also curious about the number of SCF steps required for convergence for the successful runs:
And had a closer look at the slope for those that needed more than 50 SCF steps
The vast majority of the structures that converge do so within 50 SCF steps. For those that don't, looking at the image above, they are very likely to still converge within the next 30 SCF steps if the slope is smaller than -0.1. Note that of course there may still be others that converge if we would have run with more steps.
So I would:
electron_maxstep
) to 50.mixing_beta
.For (3), I would only try this maybe once or twice. We now start with 0.4, so maybe we can set delta_factor_mixing_beta
to 0.5 and not let mixing_beta
go below 0.1.
Thanks @mbercx, this is a great analysis! Indeed, -0.1
seems to set a good threshold.
One thing to consider is that one can also extrapolate the number of max steps to reahc a certain conv_thr
.
From the fit, one has:
log(conv_thr) = a + b*nsteps
This means:
nsteps = [ log(conv_thr) - a] / b
In pratice, usually a = ~2
and conv_thr = 1.0e-P
(P = 8~12
). Hence:
nsteps = -(P+2)/b ==> P=12: nsteps ~ -14 / b ==> b=-1: nsteps ~ 14
Which seems the vast majority of you cases. From here, one can then still extrapolate from the slop the remaining steps neeeded, and set a maximum step after which we don't think it's worth it. For example max_steps ~ 300
.
For b~=0.2
==> nsteps ~ 70
So one can in still in principle take up to b~-0.05
, for which nsteps ~300
. But probably then it's more convenient to use an other strategy.
For example:
mixing_mode
: local-TF
, or TF
(the latter not sure when to use it`mixing_ndim
: default is 8, but maybe increase to 12 or 20 (more memory usage, but nowadays this won't really be a problem for actual calculations on HPCs)I think these two are the only resort in QE, a part from changing the mixing_beta
. I experienced some times that increasing e.g. cutoff
and kpoints
helped really a lot. But shall we authorize ourself to touch these parameters?
Thanks @bastonero! Adding some extrapolation is not a bad idea, will do so in https://github.com/aiidateam/aiida-quantumespresso/pull/987.
Re the next strategy, currently we only adapt mixing_beta
. As mentioned above, I've never really played with mixing_ndim
. Is one preferable over the other in your experience?
But shall we authorize ourself to touch these parameters?
To this I would say no, since the could substantially influence the results (and resource consumption ^^). Moreover, if someone is testing convergence etc one would definitely not want that.
As a final note, I've recently added a bunch of failures to the following repository:
https://github.com/mbercx/qe-issues
The plan was to give these to Baroni & co and have them make suggestions.
mixing_ndim
sometimes helped in conjunction with changing mixing_beta
but also mixing_mode
. Possibly one can try in sequence:
mixing_beta
mixing_mode
(to local-TF
)mixing_ndim
and/or mixing_beta
This is at least what I think I would try to do. But proper testing on actual problematic cases is a more accurate solution for sure.
Awesome re the issues! I saw there a lot of magnetic cases there. I feel those are the trickiest for sure, but don't have a good solution (probably having a physical intuition for a good starting magnetization can definitely help; nevertheless, sometimes larger steps are needed before the slope decreases sensibly - but sometime it's just luck).
But proper testing on actual problematic cases is a more accurate solution for sure.
Indeed. I'm trying to gather more lower-dimensional cases in https://github.com/mbercx/qe-issues. I think there was also a difficult structure set somewhere we can look into, but I forget where. Will try and remember. ^^
I feel those are the trickiest for sure, but don't have a good solution
So all those did succeed with the direct minimization implemented in nlcglib
for SIRIUS. Another thing on the TODO list is to finally properly integrate SIRIUS with the plugin, but there are still some minor things here and there to resolve before that, e.g.:
https://github.com/electronic-structure/q-e-sirius/issues/45
Yeah, this direct minimization approach sounds sweet !
FYI from QE user guide:
Self-consistency is slow or does not converge at all Bad input data will often result in bad scf convergence. Please carefully check your structure first, e.g. using XCrySDen. Assuming that your input data is sensible :
- Verify if your system is metallic or is close to a metallic state, especially if you have few k-points. If the highest occupied and lowest unoccupied state(s) keep exchanging place during self-consistency, forget about reaching convergence. A typical sign of such behavior is that the self-consistency error goes down, down, down, than all of a sudden up again, and so on. Usually one can solve the problem by adding a few empty bands and a small broadening.
- Reduce mixing beta to ∼ 0.3 ÷ 0.1 or smaller. Try the mixing mode value that is more appropriate for your problem. For slab geometries used in surface problems or for elongated cells, mixing mode=’local-TF’ should be the better choice, dampening ”charge sloshing”. You may also try to increase mixing ndim to more than 8 (default value). Beware: this will increase the amount of memory you need.
- Specific to USPP: the presence of negative charge density regions due to either the pseudization procedure of the augmentation part or to truncation at finite cutoff may give convergence problems. Raising the ecutrho cutoff for charge density will usually help.
Hello all! Thanks a lot for your effort in this, it can be very helpful. Together with the already mentioned solution of changing the mixing_mode for anisotropic structure, and ndim, in my experience also changing the diagonalization to 'cg' can help a lot. It would be helpful in my opinion to understand what are the problems in 'davidson' algorithm in order to figure out in which cases would be worth to change to (slower) 'cg' .
When a calculation doesn't reach convergence within the number used in the workchain (80 steps) it does a similar restart. Repeating the same parameters , this could be up to 4 times. Perhaps at a second stage (restart) the workchain should just be kill to avoid repeating unsuccessful calculations. On the other hand, should we consider if the system might need some extra steps? Should the restart try to include 50% more steps and if it doesn't success should it kill the job?