Pre-training Molecular Graph Representation with 3D Geometry

ICLR 2022

Authors: Shengchao Liu, Hanchen Wang, Weiyang Liu, Joan Lasenby, Hongyu Guo, Jian Tang

[Project Page] [Paper] [ArXiv] [Slides] [Poster]
[NeurIPS SSL Workshop 2021] [ICLR GTRL Workshop 2022 (Spotlight)]

This repository provides the source code for the ICLR'22 paper Pre-training Molecular Graph Representation with 3D Geometry, with the following task:

During pre-training, we consider both the 2D topology and 3D geometry.
During downstream, we consider tasks with 2D topology only.

In the future, we will merge it into the TorchDrug package.

Baselines

For implementation, this repository also provides the following graph SSL baselines:

Environments

Install packages under conda env

conda create -n GraphMVP python=3.7
conda activate GraphMVP

conda install -y -c rdkit rdkit
conda install -y -c pytorch pytorch=1.9.1
conda install -y numpy networkx scikit-learn
pip install ase
pip install git+https://github.com/bp-kelley/descriptastorus
pip install ogb
export TORCH=1.9.0
export CUDA=cu102  # cu102, cu110

wget https://data.pyg.org/whl/torch-${TORCH}%2B${CUDA}/torch_cluster-1.5.9-cp37-cp37m-linux_x86_64.whl
pip install torch_cluster-1.5.9-cp37-cp37m-linux_x86_64.whl
wget https://data.pyg.org/whl/torch-${TORCH}%2B${CUDA}/torch_scatter-2.0.9-cp37-cp37m-linux_x86_64.whl
pip install torch_scatter-2.0.9-cp37-cp37m-linux_x86_64.whl
wget https://data.pyg.org/whl/torch-${TORCH}%2B${CUDA}/torch_sparse-0.6.12-cp37-cp37m-linux_x86_64.whl
pip install torch_sparse-0.6.12-cp37-cp37m-linux_x86_64.whl
pip install torch-geometric==1.7.2

Dataset Preprocessing

For dataset download, please follow the instruction here.

For data preprocessing (GEOM), please use the following commands:

cd src_classification
python GEOM_dataset_preparation.py --n_mol 50000 --n_conf 5 --n_upper 1000 --data_folder $SLURM_TMPDIR
cd ..

cd src_regression
python GEOM_dataset_preparation.py --n_mol 50000 --n_conf 5 --n_upper 1000 --data_folder $SLURM_TMPDIR
cd ..

mv $SLURM_TMPDIR/GEOM datasets

Featurization. We employ two sets of featurization methods on atoms.

For classification tasks, in order to follow the main molecular graph SSL research line, we use the same atom featurization methods (consider the atom types and chirality).
For regression tasks, results with the above two atom-level features are too bad. Thus, we consider more comprehensive features from OGB.

Experiments

Terminology specification

In the latest scripts, we use GraphMVP for the trivial GraphMVP (Eq. 7 in the paper), and GraphMVP_hybrid includes two variants adding extra 2D SSL pretext tasks (Eq 8. in the paper). In the previous scripts, we call these two terms as 3D_hybrid_02_masking and 3D_hybrid_03_masking respectively. This could show up in some pre-trained log files here.

GraphMVP	Latest scripts	Previous scripts
Eq. 7	`GraphMVP`	`3D_hybrid_02_masking`
Eq. 8	`GraphMVP_hybrid`	`3D_hybrid_03_masking`

For GraphMVP pre-training

Check the following scripts:

scripts_classification/submit_pre_training_GraphMVP.sh
scripts_classification/submit_pre_training_GraphMVP_hybrid.sh
scripts_regression/submit_pre_training_GraphMVP.sh
scripts_regression/submit_pre_training_GraphMVP_hybrid.sh

The pre-trained model weights, training logs, and prediction files can be found here.

For Other SSL pre-training baselines

Check the following scripts:

scripts_classification/submit_pre_training_baselines.sh
scripts_regression/submit_pre_training_baselines.sh

For Downstream tasks

Check the following scripts:

scripts_classification/submit_fine_tuning.sh
scripts_regression/submit_fine_tuning.sh

Cite Us

Feel free to cite this work if you find it useful to you!

@inproceedings{liu2022pretraining,
    title={Pre-training Molecular Graph Representation with 3D Geometry},
    author={Shengchao Liu and Hanchen Wang and Weiyang Liu and Joan Lasenby and Hongyu Guo and Jian Tang},
    booktitle={International Conference on Learning Representations},
    year={2022},
    url={https://openreview.net/forum?id=xQUe1pOKPam}
}

chao1224 / GraphMVP

readme

Pre-training Molecular Graph Representation with 3D Geometry

Baselines

Environments

Dataset Preprocessing

Experiments

Terminology specification

For GraphMVP pre-training

For Other SSL pre-training baselines

For Downstream tasks

Cite Us