MolecularAI / Chemformer

Apache License 2.0
201 stars 35 forks source link

Reproducing Claimed Results for Single-Step Retrosynthesis #19

Closed Arihanth007 closed 1 year ago

Arihanth007 commented 1 year ago

I have trouble reproducing the results of the paper, and I wanted to know if there was anything that I was doing wrong. I evaluated all the fine-tuned models provided (https://az.app.box.com/s/7eci3nd9vy0xplqniitpk02rbg9q2zcq/folder/145316934583) on the USPTO-50k dataset, and the highest accuracy I can report was using the USPTO-SEP model.

Could you please elaborate on how the results were obtained and possibly share the list of reactants it was tested on? I could have somehow chosen a problematic test split that led to poor results.

EBjerrum commented 1 year ago

It sound like you are not using the dataset in https://az.app.box.com/s/7eci3nd9vy0xplqniitpk02rbg9q2zcq/folder/144882141119, when you talk about choosing a split? It is already marked with what samples are test, train and validation in the "set" column. As far as I remember the original USPTO-50K was already split up in the subsets and we kept these splits. If you are preparing your own data, it's important to keep the same formatting as the expected input, i.e. RDKit smiles.

EBjerrum commented 1 year ago

@Arihanth007 we take the conversation here, not over email, and please do provide much more detail on what you have tried. It's hard to help you, when we have to guess what it is you have been doing.

Arihanth007 commented 1 year ago

I am using the uspto_50.pickle as mentioned. Here is my code to extract the test set from the given dataset.

import pandas as pd
from rdkit import Chem

df = pd.read_pickle(r'uspto_50.pickle')

# input smiles to the model
input = df.apply(lambda x: Chem.MolToSmiles(x['products_mol']) if x['set'] == 'test' else None, axis=1)

# ground truth for the predictions
target = df.apply(lambda x: Chem.MolToSmiles(x['reactants_mol']) if x['set'] == 'test' else None, axis=1)

with open('chemformer_input_test.txt', 'w') as file_handler:
    for item in input.tolist():
        if item is not None:
            file_handler.write("{}\n".format(item))

with open('chemformer_target_test.txt', 'w') as file_handler:
    for item in target.tolist():
        if item is not None:
            file_handler.write("{}\n".format(item))

I have written code to check the accuracy based on the predicted and target smiles string saved by running predict.py. I can report an accuracy of 40.91% when tested on the uspto_sep fine-tuned model (which was also taken from the links provided in the repo).

Could you please let me know what I am doing wrong here?

Arihanth007 commented 1 year ago

Further, I tried running the evaluate.py, and here are the results on the USPTO_50k dataset using the uspto_50k fine-tuned model.

USPTO_50k_Results

From what I understand, the top-1 accuracy is 0.4%.

The command I ran for the above result:

python -m molbart.evaluate \
  --data_path data/uspto_50.pickle \
  --model_path models/uspto_50/last.ckpt \
  --dataset uspto_50 \
  --task forward_prediction \
  --model_type bart \
  --batch_size 512 \
  --num_beams 10 

I haven't changed anything in the code. I have downloaded the fine-tuned models from the link provided by the repo. Please let me know how to reproduce the claimed results of 54% accuracy on the USPTO-50k dataset.

EBjerrum commented 1 year ago

Could you show a sample of 'chemformer_input_test.txt' and 'chemformer_target_test.txt'?

also, consider use DataFrame.query or filter you dataframe, then it get's much easier to save. Then you can simply write to a CSV file using a single column and no header with to_csv command.

Strange with the direct evaluate. Ross, are you around for some input @rssrwn ?

EBjerrum commented 1 year ago

the task is set for forward prediction, but you are testing with 50k, which is retrosynthesis, I wonder if that need to be set for something different.

Arihanth007 commented 1 year ago

Thank you for your suggestion on better use of pandas. I am attaching the test split's input and ground truth files here. chemformer_input_test.txt chemformer_target_test.txt

However, I can't use the target file I generate to compute accuracy as the order of molecules changes in the output file. I suspect this is due to the data loader or something else. But it is not a problem as the output of predict.py is a pandas data frame with the original smile (the target) and the predicted smiles, which I use for computing accuracy. chemformer_pred_test_sep.txt This is the output of predict.py on the given input file. To upload it here, I changed the extension to .txt as it didn't allow for .pickle files.

I also tried passing all the data points in USPTO-50k to the model instead of just the test split. Here are the input files and the results from the model. The accuracy I got was 40.61%. chemformer_input_all.txt chemformer_pred_all_sep.txt

EBjerrum commented 1 year ago

Yes, as you go from one molecule to several in the output, you will have to evaluate each of them on their own to the right one in the target. You will also have to canonicalize the SMILES, it can be the right molecule, but not the same SMILES string., so you need to canonicalize the SMILES, sort the output SMILES and compare them in the right order. Also make sure you are not trying to do forward prediction when you want to do retrosynthesis prediction. As example, you get 40% accuracy when you use the USPTO_SEP model, which is forward prediction, on the dataset that was meant for retrosynthesis, so you may probably have done it in the wrong direction.

Arihanth007 commented 1 year ago

The problem is that I performed the forward prediction when Retrosynthesis is called the backward task in the codebase. I confirmed it by calling the backward task using evaluate.py and got an accuracy of 53.57%. Thank you for clarifying this.

How do I perform this backward task for Retrosynthesis using predict.py? I want to pass it custom input.

Arihanth007 commented 1 year ago

I have figured out how to make this work for Retrosynthesis. I modified the following function in predict.py:

def build_dataset(args):
    input = Path(args.input_path).read_text()
    input = input.split("\n")
    input = [smi for smi in input if smi != "" and smi is not None]

    target = Path(args.target_path).read_text()
    target = target.split("\n")
    target = [smi for smi in target if smi != "" and smi is not None]

    dataset = ReactionDataset(input, target)

    return dataset

The arguments for input and target are as follows (where input is the list of product smiles and target is the list of reactants):

python -m molbart.predict \
  --input_path data/input/chemformer_input_test.txt \
  --target_path data/input/chemformer_target_test.txt \
  --products_path data/uspto_50/chemformer_pred_test_50.pickle \
  --model_path models/uspto_50/last.ckpt \
  --batch_size 64 \
  --num_beams 1

I have canonicalized all the smiles strings and am getting an accuracy of 53.30%, which I believe is within the margin of error. Thank you for your guidance!