MolecularAI / Chemformer

Apache License 2.0
201 stars 35 forks source link

Interpreting Chemformer Predictions #11

Closed salman-moh closed 1 year ago

salman-moh commented 1 year ago

This is the output on USPTO-15K dataset; a few questions on the output (I'm new to the pharma domain)

  1. looks like this is a Retrosynthesis prediction task?
  2. but whats with the question marks in the predictions?
  3. if it is indeed retrosynthesis prediction, why are there exactly 10 prediction simpler molecules for all? I would think some molecules from the USPTO-15K dataset can be broken down into maybe less than 10 simpler molecules as opposed to broken down into always 10 simpler molecules.

The columns of the output file:

| original_smiles | prediction_0 | log_likelihood_0 | prediction_1 | log_likelihood_1 | prediction_2 | log_likelihood_2 | prediction_3 | log_likelihood_3 | prediction_4 | log_likelihood_4 | prediction_5 | log_likelihood_5 | prediction_6 | log_likelihood_6 | prediction_7 | log_likelihood_7 | prediction_8 | log_likelihood_8 | prediction_9 | log_likelihood_9

image

EBjerrum commented 1 year ago

You probably need to strip your input smiles of the atommapping, it seems like the code doesn't understand the input.

salman-moh commented 1 year ago

Ok, I've switched the dataset, but the problem remains, am I missing something? image

EBjerrum commented 1 year ago

Yes, the model is indeed outputting gibberish. Did you train it at all or use the pretrained weights? By standard the top 10 predictions are output I guess, they may or may not correspond to the same molecule (e.g. Alternative noncanonical SMILES).

salman-moh commented 1 year ago

No, I simply used pretrained weights... the ckpt in pretrained folder.

EBjerrum commented 1 year ago

Which ckpt?

salman-moh commented 1 year ago

models/pre-trained/combined/step=1000000.ckpt from https://az.box.com/s/7eci3nd9vy0xplqniitpk02rbg9q2zcq

EBjerrum commented 1 year ago

If you want to do retrosynthesis, you should use one of the fine-tuned ones. For retrosynthesis its th uspto_50, but be aware that it's only trained on the 10 reaction classes from the USPTO-50K dataset. That being said, it's still strange that you see the ?, using the pretrained model. It should be able to predict the same molecule, but in a different SMILES form. Do you have spaces in your SMILES? some people do pretokenization in that way, we do not, as we augment, then tokenize, so no spaces, it should be smiles that you can do a from rdkit import chem\n Chem.MolFromSmiles(smiles) on.

salman-moh commented 1 year ago

Yea theres spaces in those SMILES, got it I will make sure rdkit can recognise the SMILES first then.

I'm trying to (for now) use the Chemformer for predicting compound solubility aka the ESOL dataset downstream task. How I would need to change the DataModule/Model/Prediction pipeline?

EBjerrum commented 1 year ago

Maybe a presanitization of input SMILES could be great to have in the long run.

You will need to finetune the encoder of the pretrained Chemformer. Did we put an example there on how to do that? It was Spyridon who worked on that part.

salman-moh commented 1 year ago

Do you not have the finetuned model available? I'm talking about this one "combined" image This is fine-tuned model folder image which one of these is Combined from Table 3 in the paper?

EBjerrum commented 1 year ago

I don't work at AZ anymore, so I don't have access to the files

salman-moh commented 1 year ago

This is only the direct synthesis prediction task and retrosynthesis prediction task fine-tuning weights. For all other Seq2Seq / Discriminative tasks we will need to curate the dataset and fine-tune ourselves. As I want to do the molecule property prediction, I guess I would need to curate the data as mentioned here: image And fine-tune (from the regression folder).

Anyways I will try to predict with proper SMILES and check if the prediction works for retrosynthesis prediction task and update this issue.

Rojina99 commented 6 months ago

@salman-moh how did you tackle this issue I am having similar issue with my dataset. It is giving result with ? in between.

C=[N+]=[N-].C?(C)(C)[O-].N#CN?'