Open ezpzbz opened 4 years ago
Other things that I have noticed help with this specific issue are increasing NELMIN, or indeed decreasing EDIFF as suggested by the error message
This is a nice situation @alexsquires
As far as I remember from AiiDA
Hackathon and the concept of error handlers, we can have several solutions with different priorities for the very same issue.
I think once we move to using BaseRestartWorkchain
of aiida-core
as mentioned in #6, we may arrange a meeting to braintorm our strategy to include handlers.
I have added plenty of handlers from custodian
to the BaseRestartWorkChain
. It should be handled now or if not, implementing the handler is not an issue anymore.
It falls in the cateogry of ones that result in incomplete vasprun.xml
as explained in #25
I'm reopining until it is fixed!
Apparently, I hit this issue with proper logs.
The initial setting, ie. IBRION=2
and EDIFF=1E-06
got the error. Below is the convergence log of the first calculation:
1 Energy: -19.834370 Log|dE|: 1.297 SCF: 13 Avg|F|: 2.237 Max|F|: 4.472 Vol.: 70.5 Mag: 7.33 Time: 0.67m
2 Energy: -17.801348 Log|dE|: 0.308 SCF: 200 Avg|F|: 1.300 Max|F|: 2.490 Vol.: 109.2 Mag: 4.00 Time: 9.98m
3 Energy: -18.894443 Log|dE|: 0.039 SCF: 200 Avg|F|: 0.667 Max|F|: 1.207 Vol.: 94.1 Mag: 4.00 Time: 11.69m
4 Energy: -19.095347 Log|dE|: -0.697 SCF: 200 Avg|F|: 0.410 Max|F|: 0.578 Vol.: 88.9 Mag: 4.00 Time: 10.05m
5 Energy: -19.424230 Log|dE|: -0.483 SCF: 200 Avg|F|: 0.376 Max|F|: 0.440 Vol.: 85.0 Mag: 4.00 Time: 10.16m
6 Energy: -19.482202 Log|dE|: -1.237 SCF: 179 Avg|F|: 0.517 Max|F|: 0.717 Vol.: 85.9 Mag: 4.00 Time: 9.09m
7 Energy: -19.597861 Log|dE|: -0.937 SCF: 200 Avg|F|: 0.379 Max|F|: 0.541 Vol.: 85.3 Mag: 3.92 Time: 10.15m
8 Energy: -19.629745 Log|dE|: -1.496 SCF: 85 Avg|F|: 0.292 Max|F|: 0.469 Vol.: 85.1 Mag: 3.91 Time: 4.31m
9 Energy: -19.638833 Log|dE|: -2.042 SCF: 200 Avg|F|: 0.225 Max|F|: 0.424 Vol.: 85.0 Mag: 3.86 Time: 10.14m
10 Energy: -19.649774 Log|dE|: -1.961 SCF: 200 Avg|F|: 0.238 Max|F|: 0.460 Vol.: 85.0 Mag: 3.90 Time: 12.31m
11 Energy: -19.647889 Log|dE|: -2.725 SCF: 200 Avg|F|: 0.214 Max|F|: 0.383 Vol.: 85.0 Mag: 3.99 Time: 10.18m
12 Energy: -19.642445 Log|dE|: -2.264 SCF: 200 Avg|F|: 0.201 Max|F|: 0.342 Vol.: 85.0 Mag: 3.95 Time: 10.09m
13 Energy: -19.649912 Log|dE|: -2.127 SCF: 200 Avg|F|: 0.221 Max|F|: 0.394 Vol.: 85.0 Mag: 3.97 Time: 10.04m
14 Energy: -19.648383 Log|dE|: -2.815 SCF: 126 Avg|F|: 0.225 Max|F|: 0.405 Vol.: 85.0 Mag: 3.94 Time: 6.34m
15 Energy: -19.650251 Log|dE|: -2.728 SCF: 169 Avg|F|: 0.229 Max|F|: 0.417 Vol.: 85.0 Mag: 3.94 Time: 10.97m
16 Energy: -19.804159 Log|dE|: -0.813 SCF: 126 Avg|F|: 1.788 Max|F|: 3.548 Vol.: 70.5 Mag: 6.24 Time: 6.74m
Then, our handler decreased the EDIFF
to 1E-07
and re-performed the calculation which failed with following convergence log:
1 Energy: -19.834371 Log|dE|: 1.297 SCF: 14 Avg|F|: 2.236 Max|F|: 4.471 Vol.: 70.5 Mag: 7.33 Time: 0.74m
2 Energy: -17.794268 Log|dE|: 0.310 SCF: 140 Avg|F|: 1.238 Max|F|: 2.473 Vol.: 109.2 Mag: 4.00 Time: 7.21m
3 Energy: -18.887733 Log|dE|: 0.039 SCF: 23 Avg|F|: 0.606 Max|F|: 1.207 Vol.: 94.2 Mag: 4.00 Time: 1.17m
4 Energy: -19.077956 Log|dE|: -0.721 SCF: 200 Avg|F|: 0.283 Max|F|: 0.564 Vol.: 89.0 Mag: 4.00 Time: 10.35m
5 Energy: -19.122744 Log|dE|: -1.349 SCF: 200 Avg|F|: 0.068 Max|F|: 0.129 Vol.: 86.0 Mag: 4.00 Time: 10.45m
6 Energy: -19.138645 Log|dE|: -1.799 SCF: 200 Avg|F|: 0.245 Max|F|: 0.425 Vol.: 87.4 Mag: 4.00 Time: 10.47m
7 Energy: -19.171989 Log|dE|: -1.477 SCF: 200 Avg|F|: 0.263 Max|F|: 0.296 Vol.: 86.9 Mag: 4.00 Time: 10.41m
8 Energy: -19.499785 Log|dE|: -0.484 SCF: 200 Avg|F|: 0.639 Max|F|: 0.698 Vol.: 87.2 Mag: 4.00 Time: 10.41m
9 Energy: -19.524102 Log|dE|: -1.614 SCF: 200 Avg|F|: 0.527 Max|F|: 0.643 Vol.: 87.0 Mag: 4.00 Time: 10.41m
10 Energy: -19.540578 Log|dE|: -1.783 SCF: 200 Avg|F|: 0.449 Max|F|: 0.590 Vol.: 86.9 Mag: 4.00 Time: 10.42m
11 Energy: -19.584606 Log|dE|: -1.356 SCF: 200 Avg|F|: 0.217 Max|F|: 0.411 Vol.: 86.9 Mag: 4.00 Time: 10.40m
12 Energy: -19.588197 Log|dE|: -2.445 SCF: 36 Avg|F|: 0.210 Max|F|: 0.405 Vol.: 86.9 Mag: 4.00 Time: 1.90m
13 Energy: -19.588436 Log|dE|: -3.622 SCF: 39 Avg|F|: 0.208 Max|F|: 0.408 Vol.: 86.9 Mag: 4.00 Time: 1.97m
14 Energy: -19.588499 Log|dE|: -4.204 SCF: 11 Avg|F|: 0.208 Max|F|: 0.409 Vol.: 86.9 Mag: 4.00 Time: 0.56m
15 Energy: -19.588569 Log|dE|: -4.152 SCF: 71 Avg|F|: 0.200 Max|F|: 0.398 Vol.: 86.9 Mag: 4.00 Time: 3.66m
16 Energy: -19.588517 Log|dE|: -4.288 SCF: 10 Avg|F|: 0.202 Max|F|: 0.402 Vol.: 86.9 Mag: 4.00 Time: 0.50m
17 Energy: -19.588518 Log|dE|: -6.244 SCF: 3 Avg|F|: 0.202 Max|F|: 0.402 Vol.: 86.9 Mag: 4.00 Time: 0.14m
18 Energy: -19.588518 Log|dE|: -6.523 SCF: 3 Avg|F|: 0.202 Max|F|: 0.402 Vol.: 86.9 Mag: 4.00 Time: 0.13m
19 Energy: -19.588519 Log|dE|: -6.569 SCF: 3 Avg|F|: 0.202 Max|F|: 0.402 Vol.: 86.9 Mag: 4.00 Time: 0.13m
20 Energy: -19.804371 Log|dE|: -0.666 SCF: 36 Avg|F|: 1.800 Max|F|: 3.599 Vol.: 70.5 Mag: 6.31 Time: 1.95m
and as explained in #35 , it continued with same calculation for another four times with same outcome.
I manually tested this case by setting the IBRION=1
and keeping the EDIFF
as 1E-06
. It worked, look at the log:
1 Energy: -19.829907 Log|dE|: 1.297 SCF: 58 Avg|F|: 2.215 Max|F|: 4.428 Vol.: 70.5 Mag: 7.30 Time: 2.94m
2 Energy: -19.226156 Log|dE|: -0.219 SCF: 200 Avg|F|: 1.281 Max|F|: 2.485 Vol.: 109.7 Mag: 6.00 Time: 10.18m
3 Energy: -21.747382 Log|dE|: 0.402 SCF: 200 Avg|F|: 0.510 Max|F|: 1.017 Vol.: 93.0 Mag: 8.00 Time: 9.99m
4 Energy: -21.877292 Log|dE|: -0.886 SCF: 25 Avg|F|: 0.120 Max|F|: 0.241 Vol.: 90.1 Mag: 8.00 Time: 1.25m
5 Energy: -21.937081 Log|dE|: -1.223 SCF: 18 Avg|F|: 0.116 Max|F|: 0.230 Vol.: 92.7 Mag: 8.00 Time: 0.88m
6 Energy: -21.955825 Log|dE|: -1.727 SCF: 9 Avg|F|: 0.006 Max|F|: 0.011 Vol.: 92.1 Mag: 8.00 Time: 0.46m
7 Energy: -21.962254 Log|dE|: -2.192 SCF: 10 Avg|F|: 0.038 Max|F|: 0.077 Vol.: 91.9 Mag: 8.00 Time: 0.51m
8 Energy: -21.964829 Log|dE|: -2.589 SCF: 9 Avg|F|: 0.065 Max|F|: 0.129 Vol.: 91.2 Mag: 8.00 Time: 0.46m
9 Energy: -21.967563 Log|dE|: -2.563 SCF: 9 Avg|F|: 0.053 Max|F|: 0.104 Vol.: 90.8 Mag: 8.00 Time: 0.45m
10 Energy: -21.965514 Log|dE|: -2.688 SCF: 8 Avg|F|: 0.051 Max|F|: 0.102 Vol.: 91.2 Mag: 8.00 Time: 0.40m
11 Energy: -21.953211 Log|dE|: -1.910 SCF: 12 Avg|F|: 0.028 Max|F|: 0.056 Vol.: 93.5 Mag: 8.00 Time: 0.60m
12 Energy: -21.985533 Log|dE|: -1.491 SCF: 12 Avg|F|: 0.073 Max|F|: 0.144 Vol.: 87.6 Mag: 8.00 Time: 0.61m
13 Energy: -21.995342 Log|dE|: -2.008 SCF: 9 Avg|F|: 0.074 Max|F|: 0.146 Vol.: 86.0 Mag: 8.00 Time: 0.46m
14 Energy: -22.032695 Log|dE|: -1.428 SCF: 12 Avg|F|: 0.087 Max|F|: 0.173 Vol.: 78.8 Mag: 8.00 Time: 0.61m
15 Energy: -22.010358 Log|dE|: -1.651 SCF: 12 Avg|F|: 0.051 Max|F|: 0.100 Vol.: 83.8 Mag: 8.00 Time: 0.61m
16 Energy: -22.023736 Log|dE|: -1.874 SCF: 10 Avg|F|: 0.048 Max|F|: 0.094 Vol.: 81.5 Mag: 8.00 Time: 0.51m
17 Energy: -22.037754 Log|dE|: -1.853 SCF: 12 Avg|F|: 0.062 Max|F|: 0.122 Vol.: 73.8 Mag: 8.00 Time: 0.61m
18 Energy: -22.034298 Log|dE|: -2.461 SCF: 12 Avg|F|: 0.030 Max|F|: 0.057 Vol.: 79.5 Mag: 8.00 Time: 0.61m
19 Energy: -22.039796 Log|dE|: -2.260 SCF: 12 Avg|F|: 0.025 Max|F|: 0.047 Vol.: 78.0 Mag: 8.00 Time: 0.59m
20 Energy: -22.043399 Log|dE|: -2.443 SCF: 12 Avg|F|: 0.017 Max|F|: 0.033 Vol.: 75.6 Mag: 8.00 Time: 0.60m
21 Energy: -22.043242 Log|dE|: -3.802 SCF: 12 Avg|F|: 0.013 Max|F|: 0.024 Vol.: 76.4 Mag: 8.00 Time: 0.58m
22 Energy: -22.043506 Log|dE|: -3.578 SCF: 11 Avg|F|: 0.011 Max|F|: 0.021 Vol.: 76.1 Mag: 8.00 Time: 0.53m
23 Energy: -22.043661 Log|dE|: -3.807 SCF: 11 Avg|F|: 0.007 Max|F|: 0.012 Vol.: 75.9 Mag: 8.00 Time: 0.53m
24 Energy: -22.043700 Log|dE|: -4.411 SCF: 11 Avg|F|: 0.001 Max|F|: 0.001 Vol.: 75.6 Mag: 8.00 Time: 0.53m
I'll re-subit this particulat case with workchain fro begininng with IBRION=1
and will update this comment but so far it seems that in the case of ZBRENT ERROR
our first choice can be doing the this approach.
[UPDATE]
It worked nicely. The only trick is that we should avoid setting IBRION
in user provided INCAR
settings. Otherwise, it will overwrtie whole stages. It needs to be in a separate protcol. (consider it when writing documentation #14
The speed of calculation also got me thinking of having IBRION=1
as a normal setting. By this I mean, initially starting from it and if it does not work, going to more robust and expensive CG.
Questions?
vasprun.xml
Although it (IBRION=1
) worked nicely for one structure, there can be concerns on the structural changes and it also can results in another error detected by VASP
.
Back to VASP
wisdom seems a more reasonable choice (at least for now and majority of cases). We can improve handler once we have more data points on the occasions this may happen.
However, and related to #35 ,
I have made a slight modification to the VaspBaseWorkChain
and enabled it for this particular errror. Now, we have a dictionary which is used to keep track of times each handlers is trigerred. In this case, if it is the first time, we do decrease the EDIFF
by two orders of magnitude and increment the count to 1
. If this does not solve the issue, the next time we do not apply the fix and VaspBaseWorkChain
will terminate the workchain, so user can investigate it in more detail.
I noticed a case when performing calculation for
Li2FeP2S6
project that it has failed butVaspMultiStageWorkChain
did not capture the issue properly and also has not reported relevant log message to make life easier in debugging the error. Workchain has failed ininspect_relax
step with the following message:I traced it back and found out the
OUTCAR
is incomplete (so job is crashed) where we can find the actual error at the end of_scheduler-stdout.txt
as:According to the
VASP
mailing list, it can be resolved by changingIBRION
tag to1
(i.e.RMM-DIIS
), havingADDGRID=True
, and increasingENCUT
. We already have the two latter ones and it seems playing around withIBRION
is a way to go to solve this issue.So, I'm going to do the followings to confirm the solution and later implement the fix and handler:
[x] Resubmit the calculation using
IBRION=1
[x] Invesitgate the results
[x] Implement the error capture and a fixer in workchain