ShenAoAO / MoleSG

8 stars 0 forks source link

MoleSG

🤖 Introduction

In this study, we propose an effective multi-modality self-supervised learning framework for molecular SMILES and Graph. Specifically, SMILES data and graph data are first tokenized so that they can be processed by a unified Transformer-based backbone network, which is trained by a masked reconstruction strategy. In addition, we introduce a specialized non-overlapping masking strategy to encourage fine-grained interaction between these two modalities.

The code was built based on CoMPT and Chemberta. Thanks a lot for their code sharing!

🔬 Dependencies

Tips: Using code conda install -c conda-forge rdkit can help you install package RDKit quickly.

📚 Dataset

Dataset Tasks Type Molecule Metric
bbbp 1 Graph Classification 2,035 ROC-AUC
tox21 12 Graph Classification 7,821 ROC-AUC
ToxCast 617 Graph Classification 8,575 ROC-AUC
sider 27 Graph Classification 1,379 ROC-AUC
clintox 2 Graph Classification 1,468 ROC-AUC
bace 1 Graph Classification 1,513 ROC-AUC
esol 1 Graph Regression 1,128 RMSE
freesolv 1 Graph Regression 642 RMSE
lipophilicity 1 Graph Regression 4,198 RMSE
QM7 1 Graph Regression 6,830 MAE
QM8 12 Graph Regression 21,786 MAE
QM9 3 Graph Regression 133,885 MAE

📚 Data preprocess

For the original pre-training dataset, you can download the source dataset from Molecule-Net.

For the original downstream dataset, you can download the source dataset from ZINC15.

For your convenience, we provide our processed data and our process code for each dataset in https://drive.google.com/file/d/16MHQk8AkmyqqCI1r0vb4S9o5DT8t7dd_/view?usp=sharing.

🚀 Pre-training

If you want to retrain our pre-train model, you can run:

>> python train_total.py \
    --experiment_name pretrain_test \
    --epochs 30

We provide our pre-trained model in https://drive.google.com/file/d/1BsJyZeBfvl5QMp3gj4EBcwUu3e1Waazm/view?usp=sharing

🚀 Fine-tuning

Note that if you change the downstream benchmark, don't forget to change the corresponding dataset and split! For example:

>> python train_graph.py \
    --experiment_name test \
    --gpu 0 \
    --fold 1 \
    --dataset bbbp \
    --split scaffold \
    --gpu 1 \
    --ckpt_path 'your_pretrained_model_path'

where <seed> is the seed number, <gpu> is the gpu index number, <split> is the split method (except for qm9 is random, all are scaffold), <dataset> is the element name('bbbp', 'tox21', 'toxcast', 'sider', 'clintox', 'bace', 'muv', 'hiv','esol', 'freesolv', 'lipophilicity','qm7','qm8', 'qm9').

All hyperparameters can be tuned in the utils.py

💡 Todo