Open janweinreich opened 1 year ago
This is an excellent idea, similar to the docking/activity measurements suggestion. I'm not sure if we want to limit ourselves to classification tasks, based on the original paper, but if there were a dataset where we could make a balanced "active"/"inactive" split, maybe that would be a place to start? I would be really disgusted if this approach worked better than the SOTA for atomistic regression problems, but that would be really exciting too!
Great I started working on the SN2-20 dataset (https://iopscience.iop.org/article/10.1088/2632-2153/aba822/meta), was not published as smiles as far as I know so have to transform fist to smiles, then reaction smiles and finally compare the performance with drfp. Let's see how good we are versus https://iopscience.iop.org/article/10.1088/2632-2153/ac8f1a/meta :)
Ok to summarize: the performance is not great! Below I evaluate the performance on a subset of SN2-20 reaction energies.
REACT SMILES are reaction smiles e.g. for SN2:
[H]C([H])([H])C([H])([H])F.H>>[H]C([H])([H])C([H])([H])[H].F
However, we definitely see a better performance with kernel-ridge regression (KRR) than with KNN. I was surprised to see that a string based representation of the reaction FPS by @daenuprobst used with gzip didn't show learning (FPS-KRR). The FPS simply converts the binary vector [0,1,0,...] to a string "010..." But it is quite possible I made a mistake in the code!
I have pushed that to
https://github.com/janweinreich/molzip/tree/main/drafts/molzip_react
as well as a processed form of the reaction smiles, extracted from the original xyz data in https://github.com/janweinreich/molzip/blob/main/drafts/molzip_react/reaction_SN2-20.csv
For the FPS dataset is it the SMILES compressed or what do you mean by string based representation of the reaction?
Also, is the x-axis epochs?
1) For the FPS ("fingerprints"), I first compute the bit representation using
fps, mapping = DrfpEncoder.encode(REACT_SMILES, mapping=True, n_folded_length=512)
resulting in an array like this:
array([['0', '0', '0', ..., '0', '0', '0'], ['0', '0', '0', ..., '0', '0', '0'], ['0', '0', '0', ..., '0', '0', '0'], ..., ['0', '0', '0', ..., '0', '0', '0'], ['0', '0', '0', ..., '0', '0', '0'], ['0', '0', '0', ..., '0', '0', '0']], dtype='<U3')
with shape
(2670, 512)
note that I have converted all entries already to str
so it can be used with gzip just as we do with smiles.
Next, since gzip can only compare "global" strings i.e. one string per molecule, I join all the zeros and ones for each molecules with
FPS_single_smiles = np.array([''.join(row) for row in fps])
resuling in sth like
'0001000.....00000000000000000000'
with REACT_SMILES
on the other hand I mean the compressed reaction smiles, such as:
[H]C([H])([H])[C@](F)([N+](=O)[O-])C([H])(C#N)C#N.H>>[H]C([H])([H])[C@@]([H])([N+](=O)[O-])C([H])(C#N)C#N.F
2) the x-axis is the number of training examples used for the regression method, in this case kernel ridge regression https://github.com/daenuprobst/molzip/blob/a51cc884b5f555c1dcd9ed8f16d0e0758171fc20/gzip_regressor.py#L72
Ah thanks for the explanation! I was confused at first, but this cleared it up. Hmmm interesting results.
I imagine that the encodings for FPS did worse because the lengths of those strings were similar in length and when calculating the NCD there were no significant differences? The loss got bigger as you included more samples, which leads me to believe it had an even harder time trying to tell the difference between samples since they were so alike. Did you look at the predictions for the task. If they are all a similar number then that might mean what I was saying.
The REACT_SMILES
my initial thought is that the mean of the compressed reactions are retaining the important bits and still vary in string length enough to help with the Regression Task. I don't use SMILES on a regular basis (more of a .pdb type of guy). How exactly do you take the SMILE string and take the mean of it?
sorry for the delay! yes it could be that padding the vectors to the same length results in ineffectiveness of the compression!
I used the xyz2mol script (https://github.com/jensengroup/xyz2mol) to transform xyz files to rdkit objects here which could then easily be transformed to smiles. I did not augment the smiles but simply used canonical smiles with hydrogren (since here hydrogen also takes part in the reaction)
If useful can add benchmarks on regression on barriers (TS) on several datasets
(such as those in https://zenodo.org/record/6937747)