ezpzbz / aiida-catmat

Collection of AiiDA WorkChains Developed in CATMAT project
MIT License
3 stars 1 forks source link

Better handling of ZBRENT: fatal error in bracketing #5

Open ezpzbz opened 4 years ago

ezpzbz commented 4 years ago

I noticed a case when performing calculation for Li2FeP2S6 project that it has failed but VaspMultiStageWorkChain did not capture the issue properly and also has not reported relevant log message to make life easier in debugging the error. Workchain has failed in inspect_relax step with the following message:

Excepted <  File "<string>", line None
             xml.etree.ElementTree.ParseError: no element found: line 15612, column 0
             >

I traced it back and found out the OUTCAR is incomplete (so job is crashed) where we can find the actual error at the end of _scheduler-stdout.txt as:

curvature:   0.00 expect dE= 0.107-305 dE for cont linesearch  0.104-305
 ZBRENT: fatal error in bracketing
     please rerun with smaller EDIFF, or copy CONTCAR
     to POSCAR and continue

According to the VASP mailing list, it can be resolved by changing IBRION tag to 1 (i.e. RMM-DIIS), having ADDGRID=True, and increasing ENCUT. We already have the two latter ones and it seems playing around with IBRION is a way to go to solve this issue.

So, I'm going to do the followings to confirm the solution and later implement the fix and handler:

alexsquires commented 4 years ago

Other things that I have noticed help with this specific issue are increasing NELMIN, or indeed decreasing EDIFF as suggested by the error message

ezpzbz commented 4 years ago

This is a nice situation @alexsquires As far as I remember from AiiDA Hackathon and the concept of error handlers, we can have several solutions with different priorities for the very same issue. I think once we move to using BaseRestartWorkchain of aiida-core as mentioned in #6, we may arrange a meeting to braintorm our strategy to include handlers.

ezpzbz commented 4 years ago

I have added plenty of handlers from custodian to the BaseRestartWorkChain. It should be handled now or if not, implementing the handler is not an issue anymore.

ezpzbz commented 4 years ago

It falls in the cateogry of ones that result in incomplete vasprun.xml as explained in #25 I'm reopining until it is fixed!

ezpzbz commented 4 years ago

Apparently, I hit this issue with proper logs. The initial setting, ie. IBRION=2 and EDIFF=1E-06 got the error. Below is the convergence log of the first calculation:

 1  Energy:   -19.834370  Log|dE|:  1.297  SCF:  13  Avg|F|:  2.237  Max|F|:  4.472  Vol.:  70.5  Mag:   7.33  Time:  0.67m
   2  Energy:   -17.801348  Log|dE|:  0.308  SCF: 200  Avg|F|:  1.300  Max|F|:  2.490  Vol.: 109.2  Mag:   4.00  Time:  9.98m
   3  Energy:   -18.894443  Log|dE|:  0.039  SCF: 200  Avg|F|:  0.667  Max|F|:  1.207  Vol.:  94.1  Mag:   4.00  Time: 11.69m
   4  Energy:   -19.095347  Log|dE|: -0.697  SCF: 200  Avg|F|:  0.410  Max|F|:  0.578  Vol.:  88.9  Mag:   4.00  Time: 10.05m
   5  Energy:   -19.424230  Log|dE|: -0.483  SCF: 200  Avg|F|:  0.376  Max|F|:  0.440  Vol.:  85.0  Mag:   4.00  Time: 10.16m
   6  Energy:   -19.482202  Log|dE|: -1.237  SCF: 179  Avg|F|:  0.517  Max|F|:  0.717  Vol.:  85.9  Mag:   4.00  Time:  9.09m
   7  Energy:   -19.597861  Log|dE|: -0.937  SCF: 200  Avg|F|:  0.379  Max|F|:  0.541  Vol.:  85.3  Mag:   3.92  Time: 10.15m
   8  Energy:   -19.629745  Log|dE|: -1.496  SCF:  85  Avg|F|:  0.292  Max|F|:  0.469  Vol.:  85.1  Mag:   3.91  Time:  4.31m
   9  Energy:   -19.638833  Log|dE|: -2.042  SCF: 200  Avg|F|:  0.225  Max|F|:  0.424  Vol.:  85.0  Mag:   3.86  Time: 10.14m
  10  Energy:   -19.649774  Log|dE|: -1.961  SCF: 200  Avg|F|:  0.238  Max|F|:  0.460  Vol.:  85.0  Mag:   3.90  Time: 12.31m
  11  Energy:   -19.647889  Log|dE|: -2.725  SCF: 200  Avg|F|:  0.214  Max|F|:  0.383  Vol.:  85.0  Mag:   3.99  Time: 10.18m
  12  Energy:   -19.642445  Log|dE|: -2.264  SCF: 200  Avg|F|:  0.201  Max|F|:  0.342  Vol.:  85.0  Mag:   3.95  Time: 10.09m
  13  Energy:   -19.649912  Log|dE|: -2.127  SCF: 200  Avg|F|:  0.221  Max|F|:  0.394  Vol.:  85.0  Mag:   3.97  Time: 10.04m
  14  Energy:   -19.648383  Log|dE|: -2.815  SCF: 126  Avg|F|:  0.225  Max|F|:  0.405  Vol.:  85.0  Mag:   3.94  Time:  6.34m
  15  Energy:   -19.650251  Log|dE|: -2.728  SCF: 169  Avg|F|:  0.229  Max|F|:  0.417  Vol.:  85.0  Mag:   3.94  Time: 10.97m
  16  Energy:   -19.804159  Log|dE|: -0.813  SCF: 126  Avg|F|:  1.788  Max|F|:  3.548  Vol.:  70.5  Mag:   6.24  Time:  6.74m

Then, our handler decreased the EDIFF to 1E-07 and re-performed the calculation which failed with following convergence log:

1  Energy:   -19.834371  Log|dE|:  1.297  SCF:  14  Avg|F|:  2.236  Max|F|:  4.471  Vol.:  70.5  Mag:   7.33  Time:  0.74m
   2  Energy:   -17.794268  Log|dE|:  0.310  SCF: 140  Avg|F|:  1.238  Max|F|:  2.473  Vol.: 109.2  Mag:   4.00  Time:  7.21m
   3  Energy:   -18.887733  Log|dE|:  0.039  SCF:  23  Avg|F|:  0.606  Max|F|:  1.207  Vol.:  94.2  Mag:   4.00  Time:  1.17m
   4  Energy:   -19.077956  Log|dE|: -0.721  SCF: 200  Avg|F|:  0.283  Max|F|:  0.564  Vol.:  89.0  Mag:   4.00  Time: 10.35m
   5  Energy:   -19.122744  Log|dE|: -1.349  SCF: 200  Avg|F|:  0.068  Max|F|:  0.129  Vol.:  86.0  Mag:   4.00  Time: 10.45m
   6  Energy:   -19.138645  Log|dE|: -1.799  SCF: 200  Avg|F|:  0.245  Max|F|:  0.425  Vol.:  87.4  Mag:   4.00  Time: 10.47m
   7  Energy:   -19.171989  Log|dE|: -1.477  SCF: 200  Avg|F|:  0.263  Max|F|:  0.296  Vol.:  86.9  Mag:   4.00  Time: 10.41m
   8  Energy:   -19.499785  Log|dE|: -0.484  SCF: 200  Avg|F|:  0.639  Max|F|:  0.698  Vol.:  87.2  Mag:   4.00  Time: 10.41m
   9  Energy:   -19.524102  Log|dE|: -1.614  SCF: 200  Avg|F|:  0.527  Max|F|:  0.643  Vol.:  87.0  Mag:   4.00  Time: 10.41m
  10  Energy:   -19.540578  Log|dE|: -1.783  SCF: 200  Avg|F|:  0.449  Max|F|:  0.590  Vol.:  86.9  Mag:   4.00  Time: 10.42m
  11  Energy:   -19.584606  Log|dE|: -1.356  SCF: 200  Avg|F|:  0.217  Max|F|:  0.411  Vol.:  86.9  Mag:   4.00  Time: 10.40m
  12  Energy:   -19.588197  Log|dE|: -2.445  SCF:  36  Avg|F|:  0.210  Max|F|:  0.405  Vol.:  86.9  Mag:   4.00  Time:  1.90m
  13  Energy:   -19.588436  Log|dE|: -3.622  SCF:  39  Avg|F|:  0.208  Max|F|:  0.408  Vol.:  86.9  Mag:   4.00  Time:  1.97m
  14  Energy:   -19.588499  Log|dE|: -4.204  SCF:  11  Avg|F|:  0.208  Max|F|:  0.409  Vol.:  86.9  Mag:   4.00  Time:  0.56m
  15  Energy:   -19.588569  Log|dE|: -4.152  SCF:  71  Avg|F|:  0.200  Max|F|:  0.398  Vol.:  86.9  Mag:   4.00  Time:  3.66m
  16  Energy:   -19.588517  Log|dE|: -4.288  SCF:  10  Avg|F|:  0.202  Max|F|:  0.402  Vol.:  86.9  Mag:   4.00  Time:  0.50m
  17  Energy:   -19.588518  Log|dE|: -6.244  SCF:   3  Avg|F|:  0.202  Max|F|:  0.402  Vol.:  86.9  Mag:   4.00  Time:  0.14m
  18  Energy:   -19.588518  Log|dE|: -6.523  SCF:   3  Avg|F|:  0.202  Max|F|:  0.402  Vol.:  86.9  Mag:   4.00  Time:  0.13m
  19  Energy:   -19.588519  Log|dE|: -6.569  SCF:   3  Avg|F|:  0.202  Max|F|:  0.402  Vol.:  86.9  Mag:   4.00  Time:  0.13m
  20  Energy:   -19.804371  Log|dE|: -0.666  SCF:  36  Avg|F|:  1.800  Max|F|:  3.599  Vol.:  70.5  Mag:   6.31  Time:  1.95m

and as explained in #35 , it continued with same calculation for another four times with same outcome. I manually tested this case by setting the IBRION=1 and keeping the EDIFF as 1E-06. It worked, look at the log:

1  Energy:   -19.829907  Log|dE|:  1.297  SCF:  58  Avg|F|:  2.215  Max|F|:  4.428  Vol.:  70.5  Mag:   7.30  Time:  2.94m
   2  Energy:   -19.226156  Log|dE|: -0.219  SCF: 200  Avg|F|:  1.281  Max|F|:  2.485  Vol.: 109.7  Mag:   6.00  Time: 10.18m
   3  Energy:   -21.747382  Log|dE|:  0.402  SCF: 200  Avg|F|:  0.510  Max|F|:  1.017  Vol.:  93.0  Mag:   8.00  Time:  9.99m
   4  Energy:   -21.877292  Log|dE|: -0.886  SCF:  25  Avg|F|:  0.120  Max|F|:  0.241  Vol.:  90.1  Mag:   8.00  Time:  1.25m
   5  Energy:   -21.937081  Log|dE|: -1.223  SCF:  18  Avg|F|:  0.116  Max|F|:  0.230  Vol.:  92.7  Mag:   8.00  Time:  0.88m
   6  Energy:   -21.955825  Log|dE|: -1.727  SCF:   9  Avg|F|:  0.006  Max|F|:  0.011  Vol.:  92.1  Mag:   8.00  Time:  0.46m
   7  Energy:   -21.962254  Log|dE|: -2.192  SCF:  10  Avg|F|:  0.038  Max|F|:  0.077  Vol.:  91.9  Mag:   8.00  Time:  0.51m
   8  Energy:   -21.964829  Log|dE|: -2.589  SCF:   9  Avg|F|:  0.065  Max|F|:  0.129  Vol.:  91.2  Mag:   8.00  Time:  0.46m
   9  Energy:   -21.967563  Log|dE|: -2.563  SCF:   9  Avg|F|:  0.053  Max|F|:  0.104  Vol.:  90.8  Mag:   8.00  Time:  0.45m
  10  Energy:   -21.965514  Log|dE|: -2.688  SCF:   8  Avg|F|:  0.051  Max|F|:  0.102  Vol.:  91.2  Mag:   8.00  Time:  0.40m
  11  Energy:   -21.953211  Log|dE|: -1.910  SCF:  12  Avg|F|:  0.028  Max|F|:  0.056  Vol.:  93.5  Mag:   8.00  Time:  0.60m
  12  Energy:   -21.985533  Log|dE|: -1.491  SCF:  12  Avg|F|:  0.073  Max|F|:  0.144  Vol.:  87.6  Mag:   8.00  Time:  0.61m
  13  Energy:   -21.995342  Log|dE|: -2.008  SCF:   9  Avg|F|:  0.074  Max|F|:  0.146  Vol.:  86.0  Mag:   8.00  Time:  0.46m
  14  Energy:   -22.032695  Log|dE|: -1.428  SCF:  12  Avg|F|:  0.087  Max|F|:  0.173  Vol.:  78.8  Mag:   8.00  Time:  0.61m
  15  Energy:   -22.010358  Log|dE|: -1.651  SCF:  12  Avg|F|:  0.051  Max|F|:  0.100  Vol.:  83.8  Mag:   8.00  Time:  0.61m
  16  Energy:   -22.023736  Log|dE|: -1.874  SCF:  10  Avg|F|:  0.048  Max|F|:  0.094  Vol.:  81.5  Mag:   8.00  Time:  0.51m
  17  Energy:   -22.037754  Log|dE|: -1.853  SCF:  12  Avg|F|:  0.062  Max|F|:  0.122  Vol.:  73.8  Mag:   8.00  Time:  0.61m
  18  Energy:   -22.034298  Log|dE|: -2.461  SCF:  12  Avg|F|:  0.030  Max|F|:  0.057  Vol.:  79.5  Mag:   8.00  Time:  0.61m
  19  Energy:   -22.039796  Log|dE|: -2.260  SCF:  12  Avg|F|:  0.025  Max|F|:  0.047  Vol.:  78.0  Mag:   8.00  Time:  0.59m
  20  Energy:   -22.043399  Log|dE|: -2.443  SCF:  12  Avg|F|:  0.017  Max|F|:  0.033  Vol.:  75.6  Mag:   8.00  Time:  0.60m
  21  Energy:   -22.043242  Log|dE|: -3.802  SCF:  12  Avg|F|:  0.013  Max|F|:  0.024  Vol.:  76.4  Mag:   8.00  Time:  0.58m
  22  Energy:   -22.043506  Log|dE|: -3.578  SCF:  11  Avg|F|:  0.011  Max|F|:  0.021  Vol.:  76.1  Mag:   8.00  Time:  0.53m
  23  Energy:   -22.043661  Log|dE|: -3.807  SCF:  11  Avg|F|:  0.007  Max|F|:  0.012  Vol.:  75.9  Mag:   8.00  Time:  0.53m
  24  Energy:   -22.043700  Log|dE|: -4.411  SCF:  11  Avg|F|:  0.001  Max|F|:  0.001  Vol.:  75.6  Mag:   8.00  Time:  0.53m

I'll re-subit this particulat case with workchain fro begininng with IBRION=1 and will update this comment but so far it seems that in the case of ZBRENT ERROR our first choice can be doing the this approach.

ezpzbz commented 4 years ago

[UPDATE] It worked nicely. The only trick is that we should avoid setting IBRION in user provided INCAR settings. Otherwise, it will overwrtie whole stages. It needs to be in a separate protcol. (consider it when writing documentation #14

The speed of calculation also got me thinking of having IBRION=1 as a normal setting. By this I mean, initially starting from it and if it does not work, going to more robust and expensive CG. Questions?

ezpzbz commented 4 years ago

Although it (IBRION=1) worked nicely for one structure, there can be concerns on the structural changes and it also can results in another error detected by VASP. Back to VASP wisdom seems a more reasonable choice (at least for now and majority of cases). We can improve handler once we have more data points on the occasions this may happen.

However, and related to #35 , I have made a slight modification to the VaspBaseWorkChain and enabled it for this particular errror. Now, we have a dictionary which is used to keep track of times each handlers is trigerred. In this case, if it is the first time, we do decrease the EDIFF by two orders of magnitude and increment the count to 1. If this does not solve the issue, the next time we do not apply the fix and VaspBaseWorkChain will terminate the workchain, so user can investigate it in more detail.