crest-lab / crest

CREST - A program for the automated exploration of low-energy molecular chemical space.
https://crest-lab.github.io/crest-docs/
GNU Lesser General Public License v3.0
198 stars 42 forks source link

QCG Esolv non-reproduciliblity #181

Closed alan-arnold closed 8 months ago

alan-arnold commented 1 year ago

Hello again Christoph. You might recall that I've been using crest/xtb to investigate at the solvation energies and host-guest binding properties of a range of aquated cucurbiturils.

In the particular series of calculations described below, I've been trying to calculate the solvation energy of cucurbit[6]uril (Q6), with limited success.

All runs were performed with the crest 2.12 and xtb 6.5.1 codes downloaded on 24 Dec 2022 and compiled with v2021.6.0 ifort/icc from the 2022.2.0 Intel oneAPI toolkits, and the precompiled version of xtbiff 1.1, under Ubuntu 22.04. This is the same setup that you helped me get working in https://github.com/crest-lab/crest/issues/163

I performed four sets of three 5-ps QCG gsolv calculations. In each set, the 1st run grows a 100-water cluster around Q6 then the 2nd and 3rd runs were restarts, using the same grow cluster as the 1st run. In all runs, the --nofix option was used to permit relaxation of the Q6 solute in the solvation process.

For each of these 4 sets of 3 calculations, 4 final clusters are produced by default, from which Esolv is calculated, followed by frequency calulations to evaluate Gsolv and Hsolv of the Q6 solute

The data below are the means and standard deviations of the thermochemical properties reported by these 12 frequency calculations for the gas-phase solute, and 4 solute- and 4 solvent-clusters, of the 3 replicate runs for each of the 4 grow sets

In the xTB-IFF-grown cluster, 4 waters are placed inside the Q6 cavity. In each of the 3 aISS-grown clusters, 2 additional waters are in the cavity. In all clusters, most of the remaining waters are tightly around the two 6-oxygen portals, with the remainder spread over the outside of the more hydrophobic solute walls. Although the 6-O portals are completely solvated, the exterior walls are not quite completely covered with waters.

Despite these reservations, the intra-set and inter-set reproducibilty of these 12 frequency calculations of 100-water solvations is remarkable.

Because the Q6 solute is quite rigid, a single low-energy conformer results in all calculations (a 2nd slightly-higher conformer is observed in one of the 3 runs of the 1st set), and the thermochemical properties of the solute- and solvent-clusters for the 4 sets seem to be essentially statistically identical.

What is surprising is that the 4th set, which does NOT explicitly call for alpb solvation, is identical to the other three, which do explicitly call for alpb in the command line.

1st set of 3 calculations (xTB-IFF used to grow a 1st cluster, with alpb implicit solvation):

crest Q6.xyz --T 24 --qcg h2o.xyz --gsolv --nsolv 100 --alpb h2o --mdlen 5 --xtbiff --nofix -keepdir followed by 2 runs of crest Q6.xyz --T 24 --qcg h2o.xyz --gsolv --alpb h2o --mdlen 5 --xtbiff --nofix -keepdir

Solute  Gas properties          
#   H(T)    SVIB    SROT    STRA    G(T)
    [kcal/mol]  [ cal/mol/K ]   [kcal/mol]
--------------------------------------------------------                    
mean    530.00  201.49  35.35   46.55   445.51
sd  1.58    0.15    0.02    0.00    1.61

Solute  cluster properties          
#   H(T)    SVIB    SROT    STRA    G(T)
    [kcal/mol]  [ cal/mol/K ]   [kcal/mol]
--------------------------------------------------------                    
1   2202.60 1206.31 45.40   49.63   1814.61
2   2202.82 1205.87 45.37   49.63   1814.97
3   2203.43 1210.31 45.38   49.63   1814.25
4   2203.15 1204.30 45.38   49.63   1815.77
mean    2203.00 1206.70 45.38   49.63   1814.90
sd  0.37    2.56    0.01    0.00    0.65

Solvent cluster properties          
#   H(T)    SVIB    SROT    STRA    G(T)
    [kcal/mol]  [ cal/mol/K ]   [kcal/mol]
--------------------------------------------------------                    
1   1667.47 925.65  43.65   48.31   1364.07
2   1666.82 917.37  43.62   48.31   1365.90
3   1667.00 927.13  43.61   48.31   1363.17
4   1667.00 923.03  43.64   48.31   1364.38
mean    1667.07 923.30  43.63   48.31   1364.38
sd  0.28    4.30    0.02    0.00    1.14

2nd set of 3 calculations (aISS used to grow a 2nd cluster, with alpb implicit solvation):

crest Q6.xyz --T 24 --qcg h2o.xyz --gsolv --nsolv 100 --alpb h2o --mdlen 5 --nofix -keepdir followed by 2 runs of crest Q6.xyz --T 24 --qcg h2o.xyz --gsolv --alpb h2o --mdlen 5 --nofix -keepdir

Solute  Gas properties          
#   H(T)    SVIB    SROT    STRA    G(T)
    [kcal/mol]  [ cal/mol/K ]   [kcal/mol]
--------------------------------------------------------                    
mean    528.18  201.60  35.32   46.55   443.67
sd  0.00    0.00    0.00    0.00    0.00

Solute  cluster properties          
#   H(T)    SVIB    SROT    STRA    G(T)
    [kcal/mol]  [ cal/mol/K ]   [kcal/mol]
--------------------------------------------------------                    
1   2198.60 1210.78 45.36   49.63   1809.29
2   2198.40 1209.58 45.38   49.63   1809.44
3   2198.56 1211.06 45.40   49.63   1809.15
4   2198.26 1206.93 45.38   49.63   1810.09
mean    2198.46 1209.59 45.38   49.63   1809.49
sd  0.16    1.88    0.02    0.00    0.41

Solvent cluster properties          
#   H(T)    SVIB    SROT    STRA    G(T)
    [kcal/mol]  [ cal/mol/K ]   [kcal/mol]
--------------------------------------------------------                    
1   1666.81 925.24  43.68   48.31   1363.52
2   1666.94 927.70  43.67   48.31   1362.92
3   1666.78 926.75  43.63   48.31   1363.05
4   1666.80 928.88  43.68   48.31   1362.43
mean    1666.83 927.14  43.67   48.31   1362.98
sd  0.07    1.54    0.03    0.00    0.45

3rd set of 3 calculations (aISS again used to grow a 3rd cluster, again with alpb implicit solvation):

crest Q6.xyz --T 24 --qcg h2o.xyz --gsolv --nsolv 100 --alpb h2o --mdlen 5 --nofix -keepdir followed by 2 runs of crest Q6.xyz --T 24 --qcg h2o.xyz --gsolv --alpb h2o --mdlen 5 --nofix -keepdir

Solute  Gas properties          
#   H(T)    SVIB    SROT    STRA    G(T)
    [kcal/mol]  [ cal/mol/K ]   [kcal/mol]
--------------------------------------------------------                    
mean    528.18  201.61  35.32   46.55   443.66
sd  0.00    0.02    0.00    0.00    0.01

Solute  cluster properties          
#   H(T)    SVIB    SROT    STRA    G(T)
    [kcal/mol]  [ cal/mol/K ]   [kcal/mol]
--------------------------------------------------------                    
1   2198.16 1210.95 45.34   49.63   1808.80
2   2197.49 1209.10 45.41   49.63   1808.67
3   2197.26 1205.28 45.36   49.63   1809.59
4   2197.81 1207.35 45.36   49.63   1809.52
mean    2197.68 1208.17 45.37   49.63   1809.14
sd  0.39    2.42    0.03    0.00    0.48

Solvent cluster properties          
#   H(T)    SVIB    SROT    STRA    G(T)
    [kcal/mol]  [ cal/mol/K ]   [kcal/mol]
--------------------------------------------------------                    
1   1667.08 918.25  43.64   48.31   1365.89
2   1666.86 925.15  43.61   48.31   1363.62
3   1666.65 929.35  43.62   48.31   1362.15
4   1666.88 924.31  43.61   48.31   1363.89
mean    1666.87 924.27  43.62   48.31   1363.89
sd  0.17    4.58    0.01    0.00    1.54

4th set of 3 calculations (aISS used to grow a 4th cluster, but NOT with alpb implicit solvation):

crest Q6.xyz --T 24 --qcg h2o.xyz --gsolv h2o --nsolv 100 --mdlen 5 --nofix -keepdir followed by 2 runs of crest Q6.xyz --T 24 --qcg h2o.xyz --gsolv h2o --mdlen 5 --nofix -keepdir

Solute  Gas properties          
#   H(T)    SVIB    SROT    STRA    G(T)
    [kcal/mol]  [ cal/mol/K ]   [kcal/mol]
--------------------------------------------------------                    
mean    528.18  201.61  35.32   46.55   443.66
sd  0.00    0.02    0.00    0.00    0.01

Solute  cluster properties          
#   H(T)    SVIB    SROT    STRA    G(T)
    [kcal/mol]  [ cal/mol/K ]   [kcal/mol]
--------------------------------------------------------                    
1   2198.62 1204.25 45.32   49.63   1811.27
2   2198.22 1207.02 45.37   49.63   1810.03
3   2198.09 1212.19 45.37   49.63   1808.35
4   2198.33 1206.21 45.35   49.63   1810.38
mean    2198.32 1207.42 45.35   49.63   1810.01
sd  0.23    3.39    0.02    0.00    1.22

Solvent cluster properties          
#   H(T)    SVIB    SROT    STRA    G(T)
    [kcal/mol]  [ cal/mol/K ]   [kcal/mol]
--------------------------------------------------------                    
1   1667.30 928.48  43.62   48.31   1363.07
2   1667.02 924.78  43.71   48.31   1363.86
3   1666.82 928.72  43.58   48.31   1362.52
4   1667.00 926.24  43.63   48.31   1363.43
mean    1667.04 927.06  43.63   48.31   1363.22
sd  0.20    1.88    0.06    0.00    0.57

That's the good news.

The not-so-good news is that Esolv (and consequently, Gsolv and Hsolv) is not reproducible at all. Indeed, Esolv has a spread of 88 kcal/mol across the 12 runs!!

1st set of 3 calculations (xTB-IFF used to grow a 1st cluster, with alpb implicit solvation): Esolv = -144.72(25), -160.39(25), -216.44(50) kcal/mol

2nd set of 3 calculations (aISS used to grow a 2nd cluster, with alpb implicit solvation): Esolv = -131.28(50), -143.34(25), -138.01(25) kcal/mol

3rd set of 3 calculations (aISS again used to grow a 3rd cluster, again with alpb implicit solvation): Esolv = -136.87(76), -128.45(75), -131.62(25) kcal/mol

4th set of 3 calculations (aISS used to grow a 4th cluster, but NOT with alpb implicit solvation): Esolv = -169.76(26), -169.64(50), -187.22(50) kcal/mol

The numbers in parentheses are the number of structures within a 3.00 kcal/mol window in the final crest_rotamers_5.xyz

So, I have a few questions that I hope you can help me with:

  1. Would you expect the Esolv results to be more reproducible, given the highly consistent values of the thermochemical properties found in the frequency calculations and the rigidity of the Q6 solute? Don't the frequency calculations suggest that the 12x4 100-water+Q6 clusters have very similar structures, and therefore should have similar Esolv's??

  2. Why is the number of structures within a 3.00 kcal/mol window in the final crest_rotamers_5.xyz ensembles always close to a multiple of 25? Given the enormous conformational space of these 100-water clusters I would have expected to see much more diversity in the number of structures within the 3kcal/mol window - perhaps always more than 25, by why not 37 or 42 or 131 ...?

  3. Why are the thermochemical properties from the frequency calculations for the 4th set, which does not call for alpb solvation in the command line, so similar to the other 3 sets which do use alpb? Is alpb used by default, even if you don't call for it?

  4. When xtbiff was used to grow a 100-water cluster (Set 1) the grow time was 5h 42m. When aISS was used (Sets 2, 3 and 4) the grow times were 25h 22m, 24h 44m and 26h 53m, respectively. Would you expect the new aISS algorithm to be so much slower than xtbiff to grow these clusters??

As usual, I've used -keepdir so a lot of intermediate files are available if you need me to check something.

alan-arnold commented 1 year ago

Well this is weird, and more than a little embarrassing. I really should look more closely at the beginnng of the output files, instead of jumping to the business end.

It turns out that the 1st two runs of the 1st set of three ie cluster growth using --xtbiff, and the 1st restart, which were both done in a particular working directory, correctly have charge =0 for Q6.

After these 2 runs, I changed to a new working directory, on a NVMe SSD drive, to see if run times might improve. (They didn't :)

Although they used exactly the same Q6.xyz and h2o.xyz structure files as these 1st two runs, these later 10 runs incorrectly use charge=1, which apparently was read from a .CHRG file in this new working directory. I cannot recall having created this file, but presumably I must have at some stage for another previous project. In my defence, a file starting with a period is hidden in MacOS (under which Ubuntu is running in my case), making this a bit tricky to find!!

As an aside, in all 12 runs, uhf is reported as zero, which is not consistent with charge=0 for two runs and charge=1 for the others, so it looks like neither crest nor xtb checks for this inconsistency. Perhaps something for a future feature request.

And I'm not sure if this charge difference accounts for the inconsistent Esolv values. But, just in case, I've deleted the offending hidden .CHRG file and will do another set of 3 runs, using --xtbiff to create a cluster (much quicker than aISS), followed by 2 restarts, and see if that gives Esolv consistent with Set 1.

I'll report back ...

alan-arnold commented 1 year ago

Well that was one experiment too many!! Run number 13 broke the sequence of reproducible frequency calculations (but after deletion of the hidden .CHRG file, charge=0 as expected). Like the 1st --xtbiff run (Set 1), the Grow time was again about 5hr.

crest Q6.xyz --T 24 --qcg h2o.xyz --gsolv --nsolv 100 --alpb h2o --mdlen 5 --xtbiff --nofix -keepdir

_________________     Solvent Cluster Generation   _____________________

Method for CFF: GFN2-xTB

  =========================================
  |   quantum cluster growth: CFF         |
  =========================================

 CUT-FREEZE-FILL Algorithm to generate reference solvent cluster
 now adding solvents to fill cluster...
 Size  Cluster   E /Eh        De/kcal   Detot/kcal  Opt
 ------------------------------------------------------------------------
 adding solvent is repulsive for cluster: 1   <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
 previous cluster taken...
101       2   -514.110821    -26.04      -26.04    tight
101       3   -514.099342    -25.94      -25.94    tight
101       4   -514.108089    -26.08      -26.08    tight
 ------------------------------------------------------------------------
 Starting optimizations + SP  of structures
4 jobs to do.

done.

Cluster   E /Eh        Density  Efix       R   av/act. Surface   Opt
  1      -509.378575   1.309    0.000     0.0   0.0     5136.4   tight
  2      -509.020614   1.151    0.000     0.0   0.0     8254.8   tight
  3      -509.009250   1.148    0.000     0.0   0.0     8242.6   tight
  4      -509.017909   1.153    0.000     0.0   0.0     8178.4   tight

 ------------------------------------------------------------------------
  ------------------------------------------------------------------------
  Boltz. averaged energy of final cluster:
       G /Eh     : -509.37857471
       T*S /kcal :   0.000

  Solvent cluster generation finished.
  Results can be found in solvent_cluster directory
  Structures in file <crest_ensemble.xyz>
  Energies in file <cluster_energy.dat>
  Population in file <population.dat>

    =========================================
    |   quantum cluster growth: ESOLV       |
    |                                       |
    |           -86.31 kcal/mol             |
    =========================================

    =========================================
    |          Frequency evaluation         |
    =========================================

   SOLUTE MOLECULE
  Starting reoptimizations + Frequency computation of ensemble
  1 jobs to do.

  done.
   SOLUTE CLUSTER
  Starting reoptimizations + Frequency computation of ensemble
  4 jobs to do.

  done.
   SOLVENT CLUSTER
  Starting reoptimizations + Frequency computation of ensemble
  4 jobs to do.

  done.

   Solute Gas properties
  #       H(T)       SVIB      SROT       STRA      G(T)
       [kcal/mol]    [      cal/mol/K        ]    [kcal/mol]
  --------------------------------------------------------
      530.91    201.34     35.36     46.55    446.46

   Solute cluster properties
  #       H(T)       SVIB      SROT       STRA      G(T)
       [kcal/mol]    [      cal/mol/K        ]    [kcal/mol]
  --------------------------------------------------------
  1     2205.99   1202.27     45.37     49.63   1819.21
  2     2205.03   1212.99     45.34     49.63   1815.07
  3     2205.64   1210.72     45.34     49.63   1816.35
  4     2205.24   1208.10     45.36     49.63   1816.72

   Solvent cluster properties
  #       H(T)       SVIB      SROT       STRA      G(T)
       [kcal/mol]    [      cal/mol/K        ]    [kcal/mol]
  --------------------------------------------------------
  1        0.00      0.00      0.00      0.00      0.00 <<<<<<<<<<<<<<<<<<<<<
  2     1667.31    920.13     43.56     48.31   1365.58
  3     1666.95    932.54     43.56     48.31   1361.53
  4     1666.80    927.42     43.67     48.31   1362.87

 _____________________     Evaluation    ____________________________

  -----------------------------------------------------
  Gsolv and Hsolv ref. state: [1 M gas/solution] 
  G_solv (incl.RRHO)      = 1283.96 kcal/mol
  H_solv (incl.RRHO)      = 1586.90 kcal/mol
  -----------------------------------------------------

Unusually, in this 13th run the CFF algorithm failed for one of the 4 final clusters, causing a very different (low) Esolv, and one of the solvent clusters is missing, thereby causing Gsolv & Hsolv to be incorrectly +ve. In this case, 26 structures remain within the 3.00 kcal/mol window for GFN2-xTB optimization, from which the final 4 clusters are selected.

By comparison, in each of the 3 other --xtbiff runs (Set 1) the CFF algorithm creates 29 waters to generate the reference solvent cluster. For example, the 1st (solvent growth) of the runs of Set 1:

_________________     Solvent Cluster Generation   _____________________

 Method for CFF: GFN2-xTB

  =========================================
  |   quantum cluster growth: CFF         |
  =========================================

  CUT-FREEZE-FILL Algorithm to generate reference solvent cluster
  now adding solvents to fill cluster...
  Size  Cluster   E /Eh        De/kcal   Detot/kcal  Opt
  ------------------------------------------------------------------------
 101       1   -514.080641    -13.38      -13.38    tight
 101       2   -514.083863    -27.70      -27.70    tight
 101       3   -514.115928    -27.86      -27.86    tight
 101       4   -514.091095    -26.94      -26.94    tight
  ------------------------------------------------------------------------
 102       1   -519.207778    -35.51      -48.90    tight
 102       2   -519.201740    -29.70      -57.40    tight
 102       3   -519.227590    -25.80      -53.66    tight
 102       4   -519.198655    -23.23      -50.17    tight
...
 ------------------------------------------------------------------------
 128       1   -651.906055    -12.66     -591.14    tight
 128       2   -651.869253    -12.13     -580.35    tight
 128       3   -651.882041    -27.68     -568.41    tight
 128       4   -651.881164    -19.36     -582.52    tight
  ------------------------------------------------------------------------
 129       1   -657.005007    -17.83     -608.97    tight
 129       2   -656.964818    -15.70     -596.05    tight
 129       3   -656.969850    -10.83     -579.24    tight
 129       4   -656.968291    -10.41     -592.93    tight
  ------------------------------------------------------------------------
  volume filled
  Starting optimizations + SP  of structures
  4 jobs to do.

  done.

  Cluster   E /Eh        Density  Efix       R   av/act. Surface   Opt
    1      -509.306207   1.230    0.000     9.6  18.1     6257.3   tight
    2      -509.275052   1.217    0.000     9.8   5.8     6460.3   tight
    3      -509.278954   1.224    0.000     9.7  16.4     6412.5   tight
    4      -509.277745   1.225    0.000     9.4  15.4     6190.4   tight

  ------------------------------------------------------------------------
  ------------------------------------------------------------------------
  Boltz. averaged energy of final cluster:
       G /Eh     : -509.30620736
       T*S /kcal :  -0.000

  Solvent cluster generation finished.
  Results can be found in solvent_cluster directory
  Structures in file <crest_ensemble.xyz>
  Energies in file <cluster_energy.dat>
  Population in file <population.dat>

  =========================================
  |   quantum cluster growth: ESOLV       |
  |                                       |
  |          -144.72 kcal/mol             |
  =========================================

  =========================================
  |          Frequency evaluation         |
  =========================================

   SOLUTE MOLECULE
  Starting reoptimizations + Frequency computation of ensemble
  1 jobs to do.

  done.
   SOLUTE CLUSTER
  Starting reoptimizations + Frequency computation of ensemble
  4 jobs to do.

  done.
   SOLVENT CLUSTER
  Starting reoptimizations + Frequency computation of ensemble
 4 jobs to do.

 done.

  Solute Gas properties
 #       H(T)       SVIB      SROT       STRA      G(T)
      [kcal/mol]    [      cal/mol/K        ]    [kcal/mol]
 --------------------------------------------------------
     530.91    201.34     35.36     46.55    446.46

   Solute cluster properties
  #       H(T)       SVIB      SROT       STRA      G(T)
       [kcal/mol]    [      cal/mol/K        ]    [kcal/mol]
  --------------------------------------------------------
  1     2206.70   1194.45     45.34     49.63   1822.26
  2     2206.56   1212.39     45.35     49.63   1816.77
  3     2204.63   1205.47     45.35     49.63   1816.90
  4     2206.52   1211.20     45.33     49.63   1817.09

   Solvent cluster properties
  #       H(T)       SVIB      SROT       STRA      G(T)
       [kcal/mol]    [      cal/mol/K        ]    [kcal/mol]
  --------------------------------------------------------
  1     1666.62    906.04     43.54     48.31   1369.09
  2     1667.85    926.68     43.69     48.31   1364.13
  3     1667.17    920.76     43.62     48.31   1365.24
  4     1667.19    925.76     43.61     48.31   1363.77

_____________________     Evaluation    ____________________________

 -----------------------------------------------------
 Gsolv and Hsolv ref. state: [1 M gas/solution] 
 G_solv (incl.RRHO)      = -139.91 kcal/mol
 H_solv (incl.RRHO)      = -137.44 kcal/mol

For comparison with the previous 4 sets, I'll complete this 5th set of 3 runs by running two restarts with the cluster grown in the 13th run ...

cplett commented 1 year ago

Dear Alan, Thank you very much for the detailed description and the great investigation. I'll try to answer everything you might be concerned with: 1.) If the frequencies are similar, this does not directly mean that the structures are similar. If you have many water molecules, the frequency contribution will be quite similar even if the intermolecular geometries of the water differ. 2.) An exception is if the structure, and thus the number of H-bonds, differ a lot or if the number of molecules is different. This is exactly what happened in the case where the CFF didn't fill the hole. However, it seems that the CFF should fill some water molecules into the cavity. I'm a little bit worried that some calculation broke down and the CFF took it as converged, even though no water was filled inside the cavity. This leads to a wrong and unphysical reference ensemble and thus to wrong values. I will try to check this and fix this. 3.) That the ALPB didn't change the values much could be due to the fact that something like a first solvation shell is present. Then, the difference in the ALPB-water interactions will be similar for the reference (pure water) ensemble and the solute-solvent ensemble. However, I'm not 100% sure and will also check on this. 4.) Usually, the aISS should be a factor of 2 faster than the xtbiff. Maybe something with the parallelization went wrong. I will also check on this. 5.) That the multiple of 25 structures results after the ensemble step seems to be a result of the following: such a large cluster will have a huge conformational space and it is really difficult to find the lowest energy structures as the minima on the potential energy surface are separated by shallow barriers (rotating a water molecule does not cost as much energy as rotating around an intramolecular bond). It is probably the case that the ensemble run was not extensive enough to find the energy minimum. Therefore, as the ensemble search reaches the final MD iteration (3 MDs on the lowest structures found so far), a new energy minimum and thus 25 different conformers are found per MD. Depending on how similar the courses of the MDs are, 25, 50, or 75 structures resulted within the respective energy window. 6.) This is most probably also the reason why the enthalpies of solvation differ. Depending on which local conformer was found, the final energies will differ and also the CFF might add a different number of water molecules into the different cavities. I noticed this problem already and I'm currently working on better defaults. The current M(T)D temperatures are probably too high and also the time step of taking snapshots must be increased as the PES has many minima that are separated by low barriers. 7.) You are right that neither crest nor xtb does a consistency check for the number of unpaired electrons. I will suggest your idea to implement at least a warning. I will report back when I find something new.

github-actions[bot] commented 8 months ago

This issue had no activity for 6 months. It will be closed in 1 week unless there is some new activity.