coleygroup / rxn-ebm

Energy-based modeling of chemical reactions
MIT License
29 stars 7 forks source link
deep-learning graph-neural-networks pytorch reranking retrosynthesis

rxnebm

Improving the performance of models for one-step retrosynthesis through re-ranking Summary figure Retrosynthesis is at the core of organic chemistry. Recently, the rapid growth of artificial intelligence (AI) has spurred a variety of novel machine learning approaches for data-driven synthesis planning. These methods learn complex patterns from reaction databases in order to predict, for a given product, sets of reactants that can be used to synthesise that product. However, their performance as measured by the top-N accuracy in matching published reaction precedents still leaves room for improvement. This work aims to enhance these models by learning to re-rank their reactant predictions. Specifically, we design and train an energy-based model to re-rank, for each product, the published reaction as the top suggestion and the remaining reactant predictions as lower-ranked. We show that re-ranking can improve one-step models significantly using the standard USPTO-50k benchmark dataset, such as RetroSim, a similarity-based method, from 35.7 to 51.8% top-1 accuracy and NeuralSym, a deep learning method, from 45.7 to 51.3%, and also that re-ranking the union of two models’ suggestions can lead to better performance than either alone. However, the state-of-the-art top-1 accuracy is not improved by this method.

Citation

If you have used our code or referred to our paper, we would appreciate it if you could cite our work:

@article{lin2022improving,
  title={Improving the performance of models for one-step retrosynthesis through re-ranking},
  author={Lin, Min Htoo and Tu, Zhengkai and Coley, Connor W},
  journal={Journal of cheminformatics},
  volume={14},
  number={1},
  pages={1--13},
  year={2022},
  publisher={Springer}
}
Lin, M.H., Tu, Z. & Coley, C.W. Improving the performance of models for one-step retrosynthesis through re-ranking. J Cheminform 14, 15 (2022).

Environmental setup

Using conda

# ensure conda is already initialized
bash setup.sh
conda activate rxnebm

Data preparation / experimental setup

To get the results for our paper, we train each of the 4 one-step models on 3 seeds. Specifically, we used the following:

Thus, we have 3 sets of CSV files (train + valid + test) per one-step model (except RetroSim, which has no seed), which belong in rxnebm/data/cleaned_data/. We then train one EBM re-ranker with a specified random seed (ebm_seed) on one set of CSV file, for a total of 3 repeats per one-step model. e.g. Graph-EBM (seed 0) on NeuralSym seed 0, Graph-EBM (seed 20210423) on NeuralSym seed 20210423, and Graph-EBM (seed 77777777) on NeuralSym seed 77777777. For GLN seed 19260817 and RetroXpert seed 11111111, we use ebm_seed = 0. For RetroSim, we use ebm_seed of 0, 20210423, 77777777. We provide all 39 proposal CSV files on both figshare and Google Drive. For a useful tool to download an entire folder to a linux server, see: prasmussen's gdrive

The training proposal CSV files are quite large (~200 MB), so please ensure you do have enough storage space (4.4 GB total). Note we have not uploaded fingerprints and graph features as these files are much larger. The graph features (train + val + test) can take up as much as 30 GB, while for fingerprints it is ~1 GB. See below in each proposer section for how to generate them yourself. If there is enough demand for us to upload these (very big) files, we may consider doing so.

Training

Before training, ensure you have 1) the 3 CSV files 2) the 3 precomputed reaction data files (be it fingerprints, rxn_smi, graphs etc.). Refer to below for how we generate the reaction data files for a proposer. Note that <ebm_seed> refers to the random seed to be used for training the EBM re-ranker, and <proposer_seed> refers to the random seed that was used to train the one-step model.
Note: As RetroSim has no random seed, you do not need to provide <proposer_seed>.

If you are reloading a trained checkpoint for whatever reason, you additionally need to provide --old_expt_name <name>, --date_trained <DD_MM_YYYY> and --load_checkpoint.

For FF-EBM

bash scripts/<proposer>/FeedforwardEBM.sh <ebm_seed> <proposer_seed>

For Graph-EBM

bash scripts/<proposer>/GraphEBM.sh <ebm_seed> <proposer_seed>

For Transformer-EBM (note that this yields poor results and we only report results on RetroSim). To train this, you just need the 3 CSV files, e.g. rxnebm/data/cleaned_data/retrosim_200topk_200maxk_noGT_<phase>.csv

bash scripts/retrosim/TransformerEBM.sh <ebm_seed>

Cleaner USPTO-50K dataset

The data was obtained from the dropbox folder provided by the authors of GLN. We renamed these 3 csv files from raw_{phase}.csv to 'schneider50k_train.csv', 'schneider50k_test.csv' and 'schneider50k_valid.csv', and saved them to rxnebm/data/original_data (already included in this repo)

For the re-ranking task, we trained four different retrosynthesis models. We use a single, extra-clean USPTO50k dataset, split roughly into 80/10/10. These are derived from the three ``` schneider50k{phase}.csv files, using the scriptrxnebm/data/preprocess/clean_smiles.py```, i.e.

    python -m rxnebm.data.preprocess.clean_smiles

This data has been included in this repository under rxnebm/data/cleaned_data/ as 50k_clean_rxnsmi_noreagent_allmapped_cano_{phase}.pickle.
Note that these 3 .pickle files are extremely important, as we will use them as inputs to generate proposals & ground-truth for each one-step model.

Specifically, we perform these steps:

  1. Keep all atom mapping
  2. Remove reaction SMILES strings with product molecules that are too small and clearly incorrect. The criteria used was len(prod_smi) < 3. 4 reaction SMILES strings were caught by this criteria, with products:
    • 'CN[C@H]1CC[C@@H](c2ccc(Cl)c(Cl)c2)c2ccc([I:19])cc21>>[IH:19]'
    • 'O=C(CO)N1CCC(C(=O)[OH:28])CC1>>[OH2:28]'
    • 'CC(=O)[Br:4]>>[BrH:4]'
    • 'O=C(Cn1c(-c2ccc(Cl)c(Cl)c2)nc2cccnc21)[OH:10]>>[OH2:10]'
  3. Remove all duplicate reaction SMILES strings
  4. Remove reaction SMILES in the training data that overlap with validation/test sets + validation data that overlap with the test set.
    • test_appears_in_train: 50
    • test_appears_in_valid: 6
    • valid_appears_in_train: 44
  5. Finally, we obtain an (extra) clean dataset of reaction SMILES:
    • Train: 39713
    • Valid: 4989
    • Test: 5005
  6. Canonicalization: After running clean_smiles.py, we run canonicalize.py in the same folder:
        python -m rxnebm.data.preprocess.canonicalize

Training and generating proposals for each one-step model

Retrosim, with top-200 predictions (using 200 maximum precedents for product similarity search):

Once either the reaction fingerprints or the graphs have been generated, follow the instructions under Training above to train the EBMs.

GLN, with top-200 predictions (beam_size=200)

RetroXpert, with top-200 predictions (beam_size=200)

NeuralSym, with top-200 predictions

Union of GLN and RetroSim