choderalab / espaloma

Extensible Surrogate Potential of Ab initio Learned and Optimized by Message-passing Algorithm 🍹https://arxiv.org/abs/2010.01196
https://docs.espaloma.org/en/latest/
MIT License
203 stars 23 forks source link

scripts to prepare denali dataset for training #118

Open jeherr opened 2 years ago

jeherr commented 2 years ago

After downloading the orbnet denali dataset, first I run make_chembl_dict.py which grabs all the necessary information from the files provided by the orbnet people.

Then match_smiles.py is a script which iterates over the dictionary built by the first script and retrieves the canonical SMILES from ChEMBL that corresponds to the ChEMBL ID provided with the dataset. It then attempts to build a graph from the coordinates of the lowest energy snapshot for that molecule. If the SMILES from ChEMBL does not match the SMILES from OEChem determined by creating the graph, then we ignore this molecule. After that all snapshots are checked such that they return the same SMILES string as the minimum energy snapshot. Only snapshots which do not match are thrown away. It also filters out snapshots that are more than 0.1 Hartree higher in energy than the minimum energy snapshot for each molecule.

Finally transform.py is a small script which subtracts off nonbonded forces from the reference energies. This again filters any snapshots which are greater than 0.1 Hartree in energy higher than the minimum energy snapshot. The result of this script is vastly different energies for most snapshots, and thus the vast majority of the dataset is filtered by this last script.

lgtm-com[bot] commented 2 years ago

This pull request introduces 4 alerts when merging 1be88601f7d1307df283b6cd71714c3311e59f74 into 4c769be037ed1a7ac4c0b2d5e2e7b0c521decc96 - view on LGTM.com

new alerts:

jchodera commented 2 years ago

Then match_smiles.py is a script which iterates over the dictionary built by the first script and retrieves the canonical SMILES from ChEMBL that corresponds to the ChEMBL ID provided with the dataset. It then attempts to build a graph from the coordinates of the lowest energy snapshot for that molecule. If the SMILES from ChEMBL does not match the SMILES from OEChem determined by creating the graph, then we ignore this molecule.

@jeherr : Do you know how many molecules are filtered out by this? I wouldn't expect the SMILES from ChEMBL to match the SMILES from OEChem for multiple reasons:

  1. SMILES strings are not unique: Multiple strings map to the same molecule. It looks like you do canonicalize the SMILES with OpenEye OEMolToSmiles() to avoid this.
  2. The same real chemical compound in solution may have multiple different canonical forms that involve different protonation or tautomeric states. You'd also have to pick the same canonical protonation/tautomeric form of the molecules to ensure they're the same. It's not clear to me that you're doing this.
  3. Aromaticity models may differ.

I think the correct approach here is to ignore the ChEMBL SMILES (though you can probably keep that as metadata) and instead just check that all the snapshots for what should be a single protonation/tautomeric species have the same OpenEye perceived SMILES. That is sufficient to guarantee consistency. We can then label the molecule using the Open Force Field canonical isomeric tagged smiles with

# Convert to OpenFF Molecule
from openff.toolkit.topology import Molecule
offmol = Molecule.from_openeye(oemol)
# Generate tagged canonical isomeric SMILES with explicit hydrogens so we can reconstruct topology
tagged_smiles = offmol.to_smiles(isomeric=True, explicit_hydrogens=True, mapped=True)
jeherr commented 2 years ago

Matching the ChEMBL SMILES string only filters about 2000 unique molecules from the dataset. There is something like 98,000 left so it's a pretty insignificant chunk. Even then a fair number of them are due to some minor bug when parsing the data which I don't think is related to matching the SMILES string that I haven't bothered to fix yet.

To be clear, at this point I am still only loading the ChEMBL conformers, so there should be no protonated or tautomeric states, correct?

I don't perceive aromaticity explicitly, but I believe that is the default behavior of OEFindRingAtomsAndBonds? I could be wrong. I'm not sure if it matters here or not, unless this will change the graph used by espaloma? At a minimum it doesn't matter for matching the ChEMBL SMILES to the coordinates since the vast majority of molecules aren't filtered out by this.

codecov-commenter commented 2 years ago

Codecov Report

Merging #118 (4c769be) into master (4c769be) will not change coverage. The diff coverage is n/a.

:exclamation: Current head 4c769be differs from pull request most recent head 1be8860. Consider uploading reports for the commit 1be8860 to get more accurate results

jeherr commented 2 years ago

Just another update to this. If I train on the denali dataset without running subtract_nonbonded_cutoff function, after a short while I am already getting an RMSE on the test set of about 17 kcal/mol. Compared to the same after running the subtract_nonbonded_cutoff I get an RMSE approaching 700,000 kcal/mol. This is similar to the ANI-1 dataset where I got an RMSE around 9 kcal/mol without the nonbonded forces subtracted.

It seems like something is off with the nonbonded cutoff and this data. Maybe I should examine some cases and see what is contributing to large discrepancies in the energies for different snapshots of the same molecule?

jeherr commented 2 years ago

I've started looking at single cases to see what is happening. I've put two snapshots of the same molecule which have drastically different energies after subtracting the non-bonded forces below.

Snapshot 1
N     -8.137341  -10.078748    0.727212
S     -9.272774  -10.206999    1.942847
O    -10.192329  -11.231987    1.530874
O     -8.544293  -10.341575    3.171461
C    -10.196381   -8.680454    2.002665
C    -11.572171   -8.767362    1.865210
C    -12.347156   -7.613406    1.927217
C    -13.833607   -7.608441    1.766267
O    -14.441751   -6.603974    1.447415
N    -14.453722   -8.793941    1.979009
C    -15.906223   -8.832888    1.954126
C    -16.430710   -8.911547    0.515027
S    -15.849838  -10.383352   -0.365994
C    -16.338855  -10.024023    2.786574
O    -15.581322  -10.756592    3.361766
O    -17.661348  -10.167334    2.817408
C     -3.663894   -1.413686    0.015590
S     -3.867026    1.282770   -1.983705
C     -4.212282    1.058678   -0.216476
O     -2.969383   -2.351723    0.619511
O     -4.538483   -1.626022   -0.778272
C    -11.717601   -6.380254    2.111839
C    -10.344483   -6.300705    2.254807
C     -9.568659   -7.456325    2.202649
Cl    -7.866182   -7.331511    2.359206
H     -8.603466   -9.957365   -0.159179
H     -7.490783   -9.336768    0.957578
H    -12.018846   -9.739729    1.695387
H    -14.000905   -9.569994    2.443220
H    -16.296905   -7.913486    2.415461
H    -16.136555   -8.000773   -0.009781
H    -17.518777   -8.974446    0.534714
H    -14.553864  -10.144856   -0.194792
H    -17.888951  -10.946867    3.351314
H    -12.325914   -5.482924    2.140825
H     -9.859410   -5.343641    2.403020

Snapshot 2
N      5.281235   -0.390297    1.545866
S      4.188246   -1.312862    0.688565
O      3.637047   -2.258482    1.616260
O      4.868465   -1.765956   -0.490135
C      2.851631   -0.243136    0.177258
C      1.573471   -0.564881    0.593512
C      0.487565    0.207089    0.178140
C     -0.859838   -0.189154    0.671272
O     -0.998203   -0.941722    1.633517
N     -1.914766    0.324512    0.015887
C     -3.281233    0.032451    0.427677
C     -3.663894   -1.413686    0.015590
S     -3.867026    1.282770   -1.983705
C     -4.212282    1.058678   -0.216476
O     -2.969383   -2.351723    0.619511
O     -4.538483   -1.626022   -0.778272
C      0.713964    1.312106   -0.643747
C      1.998994    1.636482   -1.044474
C      3.081762    0.859694   -0.639229
Cl     4.658862    1.303600   -1.135424
H      4.842639   -0.034739    2.381792
H      5.661249    0.323810    0.940430
H      1.408775   -1.421712    1.237476
H     -1.846612    0.753742   -0.903202
H     -4.062904    2.035406    0.244864
H     -4.289576    0.082539   -2.353096
H     -5.247340    0.749656   -0.066097
H     -2.238753   -1.984071    1.191253
H     -3.327854    0.094017    1.523736
H     -0.109725    1.942046   -0.960098
H      2.173969    2.499843   -1.674868

and so we can inspect which atoms are experiencing large non-bonded forces I'll past them below

Snapshot 1
N        1.08713393e-01, -1.80365020e+01, -1.06772583e+01
S       -1.55115469e+00, -1.75309192e+01, -1.81559129e+00
O        2.71087708e+01, -2.92941476e+01, -3.55273107e+00
O       -1.25924874e+00, -1.49534786e+01,  4.35116742e+00
C        1.72694523e+01, -2.54408385e+01, -1.16534965e+00
C       -7.31483839e-01, -1.17438189e+01, -2.97380732e+00
C       -2.50730084e+01, -6.53081686e-01, -2.52240816e+00
C        7.39724133e+00,  2.42169036e+00,  1.33375255e+00
O        4.66518456e+00,  3.16932680e+01, -1.27932012e+01
N       -5.26335684e+00,  3.69518830e+01, -5.93858951e+00
C       -1.67287037e+01, -2.43316501e+01,  5.54309663e+00
C       -3.10616713e+00, -4.18957633e-01, -1.75819347e+01
S       -3.53078258e+00, -8.42044498e+00, -2.37673806e+01
C       -1.74536384e+00, -7.57452027e-01,  1.17303258e+01
O       -1.67907545e+01, -2.98590940e+01,  2.30145397e+01
O       -1.06414537e+01, -1.29624697e+01,  2.26671074e+01
C       -6.05332432e-01,  2.74454141e+01,  5.10051942e+00
C        1.24801373e+01,  2.47739206e+01,  3.94291617e+00
C        2.35665815e+01,  6.49913025e+00,  2.96059208e+00
Cl        1.59499040e+01,  4.15579228e+01,  9.18904995e+00
H        3.10060769e-02, -6.27078584e-03, -2.09195364e-02
H        2.59355243e-01, -1.19402902e+00, -8.54459856e-01
H       -1.39510127e+01,  1.13163442e+01, -4.88075598e-02
H        1.07663862e+00,  7.52303962e-01, -5.35402960e-01
H       -1.26884913e+01, -3.94495217e+00,  6.50557985e+00
H       -6.31769561e+00, -2.00113592e+00, -6.61315273e+00
H       -9.55284990e-01,  4.45933319e+00, -9.60420233e+00
H        1.30681891e-02, -2.53528393e-01, -4.00126939e-01
H        3.25564995e-02, -4.66902995e-04, -1.18590989e-03
H        1.30568649e+01,  1.04302218e+01,  4.31284728e+00
H       -2.07618002e+00,  3.50181041e+00,  2.15014264e-01

Snapshot 2
N        5.46038266e+00, -9.63508576e+00,  1.55716531e+01
S        2.81964001e+00, -1.49633587e+01,  9.52096592e+00
O        2.85787345e+01, -1.53234853e+01,  8.20003093e+00
O        3.44476519e+00, -1.80781107e+01,  3.17356761e+00
C        2.31749002e+01, -1.64301635e+01,  8.55370902e+00
C        1.91586883e+00, -1.22695309e+01,  5.64867443e+00
C       -2.32916278e+01, -5.68863209e+00,  7.47058317e+00
C        1.10379330e+01,  1.28215818e+01,  2.48472211e+00
O        7.21082671e+01,  5.07010108e+01,  5.68877615e+01
N       -2.93001730e-01, -1.28763215e+01,  2.76115499e+01
C       -2.07217226e+01,  8.77622021e+00, -1.04717525e+01
C        8.73345216e+05,  5.22319822e+05,  8.00164449e+05
S        3.01915827e+03,  1.98327480e+03, -1.56077861e+04
C        3.63654868e+05,  1.07514702e+05, -3.77175989e+04
O        1.64081739e+05, -2.78452058e+05,  1.45369037e+05
O       -1.06809378e+06, -2.59358883e+05, -9.69476410e+05
C       -5.58807722e+00,  2.32571184e+01, -1.32056476e+01
C        4.50758663e+00,  2.28002519e+01, -1.69483804e+01
C        2.02299488e+01,  9.93055139e+00, -9.58394346e+00
Cl        2.93157973e+00,  3.71750775e+01, -2.75151677e+01
H        3.09415183e-02,  1.02943739e-02,  1.46294433e-02
H        5.41352282e-01, -4.81166800e-01,  1.06094007e+00
H       -6.09970801e+00,  1.61026559e+00, -3.13332858e+00
H        2.73499004e+00, -9.78060380e-01,  1.39562532e+00
H        2.13423148e+01,  3.96127332e+01,  4.42657342e+01
H        2.44755828e+00,  1.43240134e+01, -3.63964394e+01
H       -3.66718869e+05, -1.09496850e+05,  5.32708185e+04
H        3.05593533e+04,  1.53638623e+04,  2.39299946e+04
H        2.97125216e-02, -2.55892519e-02, -1.31012520e-02
H        7.96437073e+00,  9.40390755e+00, -5.71203800e+00
H       -2.99323150e+00,  2.45890254e+00, -1.38797877e+00

and just to make it easier to discuss here I'll attach a screenshot of each molecule

Snapshot 1 Screen Shot 2022-05-02 at 12 11 02 PM (2) copy

Snapshot 2 Screen Shot 2022-05-02 at 12 14 45 PM (2)

The non-bonded energy for the two snapshots are 102.526 kcal/mol and 228050.667 kcal/mol respectively. Obviously something is wrong here! The large forces are located on atoms towards the end of the tail coming off of the ring, primarily the two terminating carbons, the sulfur, the oxygens, and a few of the hydrogen. It appears rotation around the carbonyl is causing the large change in structure which leads to the drastic change in non-bonded energies.

jeherr commented 2 years ago

Just to add, there is nothing obvious to me why the energy of the second snapshot should be drastically higher than the first snapshot at the QM level, so my intuition is that something is not well described by the force field.