daenuprobst / molzip

The gzip classification method implemented for molecule classification.
MIT License
52 stars 10 forks source link

TODO (for me :) ) #10

Open janweinreich opened 11 months ago

janweinreich commented 11 months ago

If useful can add benchmarks on regression on barriers (TS) on several datasets

(such as those in https://zenodo.org/record/6937747)

EricBoittier commented 11 months ago

This is an excellent idea, similar to the docking/activity measurements suggestion. I'm not sure if we want to limit ourselves to classification tasks, based on the original paper, but if there were a dataset where we could make a balanced "active"/"inactive" split, maybe that would be a place to start? I would be really disgusted if this approach worked better than the SOTA for atomistic regression problems, but that would be really exciting too!

janweinreich commented 11 months ago

Great I started working on the SN2-20 dataset (https://iopscience.iop.org/article/10.1088/2632-2153/aba822/meta), was not published as smiles as far as I know so have to transform fist to smiles, then reaction smiles and finally compare the performance with drfp. Let's see how good we are versus https://iopscience.iop.org/article/10.1088/2632-2153/ac8f1a/meta :)

janweinreich commented 11 months ago

Ok to summarize: the performance is not great! Below I evaluate the performance on a subset of SN2-20 reaction energies.

learning_curve

REACT SMILES are reaction smiles e.g. for SN2: [H]C([H])([H])C([H])([H])F.H>>[H]C([H])([H])C([H])([H])[H].F

However, we definitely see a better performance with kernel-ridge regression (KRR) than with KNN. I was surprised to see that a string based representation of the reaction FPS by @daenuprobst used with gzip didn't show learning (FPS-KRR). The FPS simply converts the binary vector [0,1,0,...] to a string "010..." But it is quite possible I made a mistake in the code!

I have pushed that to

https://github.com/janweinreich/molzip/tree/main/drafts/molzip_react

as well as a processed form of the reaction smiles, extracted from the original xyz data in https://github.com/janweinreich/molzip/blob/main/drafts/molzip_react/reaction_SN2-20.csv

PowersPope commented 11 months ago

For the FPS dataset is it the SMILES compressed or what do you mean by string based representation of the reaction?

Also, is the x-axis epochs?

janweinreich commented 11 months ago

1) For the FPS ("fingerprints"), I first compute the bit representation using fps, mapping = DrfpEncoder.encode(REACT_SMILES, mapping=True, n_folded_length=512) resulting in an array like this:

array([['0', '0', '0', ..., '0', '0', '0'], ['0', '0', '0', ..., '0', '0', '0'], ['0', '0', '0', ..., '0', '0', '0'], ..., ['0', '0', '0', ..., '0', '0', '0'], ['0', '0', '0', ..., '0', '0', '0'], ['0', '0', '0', ..., '0', '0', '0']], dtype='<U3')

with shape (2670, 512)

note that I have converted all entries already to str so it can be used with gzip just as we do with smiles. Next, since gzip can only compare "global" strings i.e. one string per molecule, I join all the zeros and ones for each molecules with

FPS_single_smiles = np.array([''.join(row) for row in fps])

resuling in sth like '0001000.....00000000000000000000'

with REACT_SMILES on the other hand I mean the compressed reaction smiles, such as: [H]C([H])([H])[C@](F)([N+](=O)[O-])C([H])(C#N)C#N.H>>[H]C([H])([H])[C@@]([H])([N+](=O)[O-])C([H])(C#N)C#N.F

2) the x-axis is the number of training examples used for the regression method, in this case kernel ridge regression https://github.com/daenuprobst/molzip/blob/a51cc884b5f555c1dcd9ed8f16d0e0758171fc20/gzip_regressor.py#L72

PowersPope commented 11 months ago

Ah thanks for the explanation! I was confused at first, but this cleared it up. Hmmm interesting results.

I imagine that the encodings for FPS did worse because the lengths of those strings were similar in length and when calculating the NCD there were no significant differences? The loss got bigger as you included more samples, which leads me to believe it had an even harder time trying to tell the difference between samples since they were so alike. Did you look at the predictions for the task. If they are all a similar number then that might mean what I was saying.

The REACT_SMILES my initial thought is that the mean of the compressed reactions are retaining the important bits and still vary in string length enough to help with the Regression Task. I don't use SMILES on a regular basis (more of a .pdb type of guy). How exactly do you take the SMILE string and take the mean of it?

janweinreich commented 11 months ago

sorry for the delay! yes it could be that padding the vectors to the same length results in ineffectiveness of the compression!

I used the xyz2mol script (https://github.com/jensengroup/xyz2mol) to transform xyz files to rdkit objects here which could then easily be transformed to smiles. I did not augment the smiles but simply used canonical smiles with hydrogren (since here hydrogen also takes part in the reaction)