PwBaseCalculations number of electron_maxstep is fixed and repeat the same simulation

AndresOrtegaGuerrero commented 1 year ago

When a calculation doesn't reach convergence within the number used in the workchain (80 steps) it does a similar restart. Repeating the same parameters , this could be up to 4 times. Perhaps at a second stage (restart) the workchain should just be kill to avoid repeating unsuccessful calculations. On the other hand, should we consider if the system might need some extra steps? Should the restart try to include 50% more steps and if it doesn't success should it kill the job?

mbercx commented 1 year ago

Thanks for raising the issue @AndresOrtegaGuerrero! Note that in case the electronic convergence fails, the PwBaseWorkChain reduces the mixing_beta :

https://github.com/aiidateam/aiida-quantumespresso/blob/5d4a7d9405b757e2ecf65a72c1ee92aa2fb36a39/src/aiida_quantumespresso/workflows/pw/base.py#L559-L576

However, we have already discussed in the past that:

The current factor for reducing the mixing_beta is too high:

https://github.com/aiidateam/aiida-quantumespresso/blob/5d4a7d9405b757e2ecf65a72c1ee92aa2fb36a39/src/aiida_quantumespresso/workflows/pw/base.py#L32

Perhaps it only makes sense to try this once.

Increasing the number of electronic steps electron_maxstep might also be sensible, but perhaps only in case we see that the SCF is converging?

AndresOrtegaGuerrero commented 1 year ago

@mbercx I like the idea that it checks the output and it takes a decision upon the scf convergence and/change. But definitively to not do 4 times a restart that will end up in failure.

t-reents commented 1 year ago

Hi together! @mbercx Just out of curiosity, have you also discussed to include mixing_ndim in the past? I included this parameter in my own workchains on top of the plugin. It will be also increased in steps of 4 (up to a given maximum) in case the mixing_beta is already quite small. I think that I remember some calculations where it seems that this actually helped. Of course, increasing mixing_ndim also increases the memory usage so this might be the reason why this is not included per default. If I remember correctly, increasing mixing_ndim and decreasing mixing_beta might increase the necessary steps to converge and for that reason I also increase electron_maxstep in case mixing_ndim is increased.

mbercx commented 1 year ago

But definitively to not do 4 times a restart that will end up in failure.

Yeah, I've also done some tests for this. It's rare that it suddenly works after e.g. 2 restarts.

Just out of curiosity, have you also discussed to include mixing_ndim in the past?

Thanks @t-reents! I haven't played around with this setting much, to be honest, as the default (8) already seemed quite high? But it may be worth testing.

Another approach which we still have to fully integrate here is the "direct minimization" method implemented in SIRIUS by @ simonpintarelli:

https://github.com/simonpintarelli/nlcglib

I've tested the robustness of this alternative approach of finding the electronic ground state quite extensively in the past, and it is very promising.

AndresOrtegaGuerrero commented 1 year ago

@cpignedoli Could you share also your experience on these restarts ?

bastonero commented 12 months ago

Hello everyone, I am also interested in this discussion as I would like to improve the handler too. As I collected over ~3 years some data, I performed some "scientific" analysis on PwCalculation that finished correctly (exit status 0) and the one having issues with the electronic convergence (exit status 410).

In the following plot I report the counts for the slope of the convergence, computed using a linear fit (steps vs. log(estimated scf convergence)). convergence_statistics.pdf You can reproduce with the snippet below. It would be good to collect some more statistics, especially on the failed ones. My statistics tells that the failed run have slopes ~ -0.1 or greater, whereas -0.2 starts to get borderline, as you can see, while lower values mean convergence will be achieved.

TAKE-HOME MESSAGE: as a rule of thumb one can compute the slope, and if it is lower than -0.1/-0.2, then it is probably worth it to increase the steps, otherwise don't and try a different strategy (e.g. change mixing mode, mixing beta, or something else).

from aiida import load_profile
from aiida.orm import *
import numpy as np
import matplotlib.pyplot as plt

load_profile()

MAX_NODE_COUNT = 10000

def linear(x,*args):
    return args[0] +args[1]*x

def fit_value(kn, array, guess):
    from scipy.optimize import curve_fit
    params, cov = curve_fit(linear, kn, array, p0=[1, (array[0]-array[-1])/(kn[0]-kn[-1])])

    return params, cov

q = QueryBuilder()
q.append(CalcJobNode, filters={
    'attributes.exit_status': {'in':[0]},
    'attributes.process_state': 'finished',
    'attributes.process_label': 'PwCalculation',
})
print("Tot. PwCalculation: ", q.count())

slopes_ok = []

count = 0
for n in q.iterall():
    try:
        y = n[0].tools.get_scf_accuracy(0)
        x = np.array(list(range(len(y))))+1
        params, cov = fit_value(x, np.log(y), 0)
        slopes_ok.append(params[1])
    except:
        pass
    count+=1
    if count > MAX_NODE_COUNT:
        break

q = QueryBuilder()
q.append(CalcJobNode, filters={
    'attributes.exit_status': {'in':[410]},
    'attributes.process_state': 'finished',
    'attributes.process_label': 'PwCalculation',
})
print("Tot. PwCalculation: ", q.count())

slopes_fail = []

count = 0
for n in q.iterall():
    try:
        y = n[0].tools.get_scf_accuracy(0)
        x = np.array(list(range(len(y))))+1
        params, cov = fit_value(x, np.log(y), 0)
        slopes_fail.append(params[1])
    except:
        pass
    count+=1
    if count > MAX_NODE_COUNT:
        break

plt.hist(slopes_ok, bins=120, label='Converged')
plt.hist(slopes_fail, bins=120, label='Failed')
plt.xlabel('Number of SCF')
plt.ylabel('Convergence slope')
plt.legend()
plt.savefig('./convergence_statistics.pdf', dpi=300, transparent = True, pad_inches = .1, bbox_inches = 'tight')

mbercx commented 12 months ago

Nice, thanks @bastonero! Once I get to my computer I'll check for the MC3D runs, that should give us some more statistics ^^

mbercx commented 12 months ago

Statistics on the MC3D rSCF runs (first iterations (mixing_beta = 0.4), structures obtained directly from the databases, lanthanides have been avoided here):

I was also curious about the number of SCF steps required for convergence for the successful runs:

And had a closer look at the slope for those that needed more than 50 SCF steps

The vast majority of the structures that converge do so within 50 SCF steps. For those that don't, looking at the image above, they are very likely to still converge within the next 30 SCF steps if the slope is smaller than -0.1. Note that of course there may still be others that converge if we would have run with more steps.

So I would:

Limit the default number of maximum steps (electron_maxstep) to 50.
In case of SCF convergence failure, check the slope. If < -0.1, do a full restart (i.e. from wave functions/charge density; not sure if we need more steps then?).
Else reduce mixing_beta.

For (3), I would only try this maybe once or twice. We now start with 0.4, so maybe we can set delta_factor_mixing_beta to 0.5 and not let mixing_beta go below 0.1.

Script to generate images

```python #!/usr/bin/env python # -*- coding: utf-8 -*- from typing import Optional import matplotlib.pyplot as plt import numpy as np import typer from aiida import load_profile, orm from aiida_quantumespresso.calculations.pw import PwCalculation from rich import print from rich.progress import track from scipy.optimize import curve_fit import numpy as np load_profile("prod") app = typer.Typer() def get_slope_scipy(kn, array): """Perform linear fit using scipy's curve_fit function""" def linear(x,*args): return args[0] +args[1]*x params, _ = curve_fit(linear, kn, array, p0=[1, (array[0]-array[-1])/(kn[0]-kn[-1])]) return params[1] def get_slope_numpy(x, y): """Perform linear fit using numpy's polyfit function""" return np.polyfit(x, y, 1)[0] @app.command() def slopes(pwbase_group, min_nstep: Optional[int] = None, limit: int = None, method: str = "numpy"): """Sort the slopes for successful and failed calculations.""" query = orm.QueryBuilder() query.append(orm.Group, filters={"label": pwbase_group}, tag="group").append( orm.WorkChainNode, with_group="group", tag="workchain" ).append( PwCalculation, tag="pw", edge_filters={"label": "iteration_01"}, filters={"attributes.exit_status": {"in": [0, 410]}}, project='*' ) if min_nstep is not None: query.append( orm.Dict, with_incoming="pw", edge_filters={"label": "output_parameters"}, filters={ "attributes.convergence_info.scf_conv.n_scf_steps": {">": min_nstep} }, ) if limit is not None: query.limit(limit) number = query.count() print(f"[bold blue]Report:[/] Number of completed pw.x: {number}") slopes_ok = [] slopes_fail = [] errors = [] for [pw_calculation] in track( query.iterall(), total=number, description="Checking slopes" ): try: y = pw_calculation.tools.get_scf_accuracy(0) x = np.arange(len(y)) if method == "numpy": slope = get_slope_numpy(x, np.log(y)) elif method == "scipy": slope = get_slope_scipy(x, np.log(y)) if pw_calculation.exit_status == 0: slopes_ok.append(slope) else: slopes_fail.append(slope) except Exception as e: errors.append((pw_calculation.pk, e)) plt.hist(slopes_ok, bins=120, label="Converged", alpha=0.6) plt.hist(slopes_fail, bins=120, label="Failed", alpha=0.6) plt.xlabel("Convergence slope") plt.ylabel("Number of SCF") plt.legend() if min_nstep is not None: filename = f"slopes_{min_nstep}.png" else: filename = "slopes.png" plt.savefig( filename, dpi=300, pad_inches=0.1, bbox_inches="tight" ) @app.command() def steps(pwbase_group, max_nstep: int = 50, limit: int = None): """How many steps are needed to converge?""" query = orm.QueryBuilder() query.append(orm.Group, filters={"label": pwbase_group}, tag="group").append( orm.WorkChainNode, with_group="group", tag="workchain" ).append( PwCalculation, tag="pw", edge_filters={"label": "iteration_01"}, filters={"attributes.exit_status": 0}, project="attributes.exit_status", ).append( orm.Dict, with_incoming="pw", edge_filters={"label": "output_parameters"}, project="attributes.convergence_info.scf_conv.n_scf_steps", ) if limit is not None: query.limit(limit) number = query.count() print(f"[bold blue]Report:[/] Number of completed pw.x: {number}") n_scf_list = [] for exit_status, n_scf_steps in track( query.iterall(), total=number, description="Gathering steps" ): if exit_status == 0: n_scf_list.append(n_scf_steps) plt.hist(n_scf_list, bins=40) plt.xlabel("Number of SCF Steps") plt.ylabel("Frequency") plt.axvline(x=max_nstep, color="red") struct_perc = round( sum([1 for i in n_scf_list if i <= max_nstep]) / number * 100, 1 ) plt.text( max_nstep, plt.ylim()[1] + 0.05 * (plt.ylim()[1] - plt.ylim()[0]), f"{struct_perc}% of structures", color="red", verticalalignment="top", horizontalalignment="right", ) plt.savefig( "./steps.png", dpi=300, pad_inches=0.1, bbox_inches="tight" ) if __name__ == "__main__": app() ```

bastonero commented 12 months ago

Thanks @mbercx, this is a great analysis! Indeed, -0.1 seems to set a good threshold. One thing to consider is that one can also extrapolate the number of max steps to reahc a certain conv_thr. From the fit, one has: log(conv_thr) = a + b*nsteps This means: nsteps = [ log(conv_thr) - a] / b In pratice, usually a = ~2 and conv_thr = 1.0e-P (P = 8~12). Hence: nsteps = -(P+2)/b ==> P=12: nsteps ~ -14 / b ==> b=-1: nsteps ~ 14 Which seems the vast majority of you cases. From here, one can then still extrapolate from the slop the remaining steps neeeded, and set a maximum step after which we don't think it's worth it. For example max_steps ~ 300. For b~=0.2 ==> nsteps ~ 70 So one can in still in principle take up to b~-0.05, for which nsteps ~300. But probably then it's more convenient to use an other strategy. For example:

Change mixing_mode: local-TF, or TF (the latter not sure when to use it`
Increase mixing_ndim: default is 8, but maybe increase to 12 or 20 (more memory usage, but nowadays this won't really be a problem for actual calculations on HPCs)

I think these two are the only resort in QE, a part from changing the mixing_beta. I experienced some times that increasing e.g. cutoff and kpoints helped really a lot. But shall we authorize ourself to touch these parameters?

mbercx commented 12 months ago

Thanks @bastonero! Adding some extrapolation is not a bad idea, will do so in https://github.com/aiidateam/aiida-quantumespresso/pull/987.

Re the next strategy, currently we only adapt mixing_beta. As mentioned above, I've never really played with mixing_ndim. Is one preferable over the other in your experience?

But shall we authorize ourself to touch these parameters?

To this I would say no, since the could substantially influence the results (and resource consumption ^^). Moreover, if someone is testing convergence etc one would definitely not want that.

As a final note, I've recently added a bunch of failures to the following repository:

https://github.com/mbercx/qe-issues

The plan was to give these to Baroni & co and have them make suggestions.

I'll definitely unleach our new strategy on these before merging the PR.
If you have any problematic cases to add, feel free to open a PR to the issues repo. I just selected ~100 ones for smaller systems for testing.

bastonero commented 12 months ago

mixing_ndim sometimes helped in conjunction with changing mixing_beta but also mixing_mode. Possibly one can try in sequence:

Change mixing_beta
Change mixing_mode (to local-TF)
Change again mixing_ndim and/or mixing_beta
As point 3.

This is at least what I think I would try to do. But proper testing on actual problematic cases is a more accurate solution for sure.

Awesome re the issues! I saw there a lot of magnetic cases there. I feel those are the trickiest for sure, but don't have a good solution (probably having a physical intuition for a good starting magnetization can definitely help; nevertheless, sometimes larger steps are needed before the slope decreases sensibly - but sometime it's just luck).

mbercx commented 12 months ago

But proper testing on actual problematic cases is a more accurate solution for sure.

Indeed. I'm trying to gather more lower-dimensional cases in https://github.com/mbercx/qe-issues. I think there was also a difficult structure set somewhere we can look into, but I forget where. Will try and remember. ^^

I feel those are the trickiest for sure, but don't have a good solution

So all those did succeed with the direct minimization implemented in nlcglib for SIRIUS. Another thing on the TODO list is to finally properly integrate SIRIUS with the plugin, but there are still some minor things here and there to resolve before that, e.g.:

https://github.com/electronic-structure/q-e-sirius/issues/45

bastonero commented 12 months ago

Yeah, this direct minimization approach sounds sweet !

bastonero commented 12 months ago

FYI from QE user guide:

Self-consistency is slow or does not converge at all Bad input data will often result in bad scf convergence. Please carefully check your structure first, e.g. using XCrySDen. Assuming that your input data is sensible :

Verify if your system is metallic or is close to a metallic state, especially if you have few k-points. If the highest occupied and lowest unoccupied state(s) keep exchanging place during self-consistency, forget about reaching convergence. A typical sign of such behavior is that the self-consistency error goes down, down, down, than all of a sudden up again, and so on. Usually one can solve the problem by adding a few empty bands and a small broadening.

Reduce mixing beta to ∼ 0.3 ÷ 0.1 or smaller. Try the mixing mode value that is more appropriate for your problem. For slab geometries used in surface problems or for elongated cells, mixing mode=’local-TF’ should be the better choice, dampening ”charge sloshing”. You may also try to increase mixing ndim to more than 8 (default value). Beware: this will increase the amount of memory you need.

Specific to USPP: the presence of negative charge density regions due to either the pseudization procedure of the augmentation part or to truncation at finite cutoff may give convergence problems. Raising the ecutrho cutoff for charge density will usually help.

ccigna commented 12 months ago

Hello all! Thanks a lot for your effort in this, it can be very helpful. Together with the already mentioned solution of changing the mixing_mode for anisotropic structure, and ndim, in my experience also changing the diagonalization to 'cg' can help a lot. It would be helpful in my opinion to understand what are the problems in 'davidson' algorithm in order to figure out in which cases would be worth to change to (slower) 'cg' .

aiidateam / aiida-quantumespresso

PwBaseCalculations number of electron_maxstep is fixed and repeat the same simulation #961