ICLR 2022
Authors: Shengchao Liu, Hanchen Wang, Weiyang Liu, Joan Lasenby, Hongyu Guo, Jian Tang
[Project Page]
[Paper]
[ArXiv]
[Slides]
[Poster]
[NeurIPS SSL Workshop 2021]
[ICLR GTRL Workshop 2022 (Spotlight)]
This repository provides the source code for the ICLR'22 paper Pre-training Molecular Graph Representation with 3D Geometry, with the following task:
In the future, we will merge it into the TorchDrug package.
For implementation, this repository also provides the following graph SSL baselines:
Install packages under conda env
conda create -n GraphMVP python=3.7
conda activate GraphMVP
conda install -y -c rdkit rdkit
conda install -y -c pytorch pytorch=1.9.1
conda install -y numpy networkx scikit-learn
pip install ase
pip install git+https://github.com/bp-kelley/descriptastorus
pip install ogb
export TORCH=1.9.0
export CUDA=cu102 # cu102, cu110
wget https://data.pyg.org/whl/torch-${TORCH}%2B${CUDA}/torch_cluster-1.5.9-cp37-cp37m-linux_x86_64.whl
pip install torch_cluster-1.5.9-cp37-cp37m-linux_x86_64.whl
wget https://data.pyg.org/whl/torch-${TORCH}%2B${CUDA}/torch_scatter-2.0.9-cp37-cp37m-linux_x86_64.whl
pip install torch_scatter-2.0.9-cp37-cp37m-linux_x86_64.whl
wget https://data.pyg.org/whl/torch-${TORCH}%2B${CUDA}/torch_sparse-0.6.12-cp37-cp37m-linux_x86_64.whl
pip install torch_sparse-0.6.12-cp37-cp37m-linux_x86_64.whl
pip install torch-geometric==1.7.2
For dataset download, please follow the instruction here.
For data preprocessing (GEOM), please use the following commands:
cd src_classification
python GEOM_dataset_preparation.py --n_mol 50000 --n_conf 5 --n_upper 1000 --data_folder $SLURM_TMPDIR
cd ..
cd src_regression
python GEOM_dataset_preparation.py --n_mol 50000 --n_conf 5 --n_upper 1000 --data_folder $SLURM_TMPDIR
cd ..
mv $SLURM_TMPDIR/GEOM datasets
Featurization. We employ two sets of featurization methods on atoms.
In the latest scripts, we use GraphMVP
for the trivial GraphMVP (Eq. 7 in the paper), and GraphMVP_hybrid
includes two variants adding extra 2D SSL pretext tasks (Eq 8. in the paper).
In the previous scripts, we call these two terms as 3D_hybrid_02_masking
and 3D_hybrid_03_masking
respectively.
This could show up in some pre-trained log files here.
GraphMVP | Latest scripts | Previous scripts |
---|---|---|
Eq. 7 | GraphMVP |
3D_hybrid_02_masking |
Eq. 8 | GraphMVP_hybrid |
3D_hybrid_03_masking |
Check the following scripts:
scripts_classification/submit_pre_training_GraphMVP.sh
scripts_classification/submit_pre_training_GraphMVP_hybrid.sh
scripts_regression/submit_pre_training_GraphMVP.sh
scripts_regression/submit_pre_training_GraphMVP_hybrid.sh
The pre-trained model weights, training logs, and prediction files can be found here.
Check the following scripts:
scripts_classification/submit_pre_training_baselines.sh
scripts_regression/submit_pre_training_baselines.sh
Check the following scripts:
scripts_classification/submit_fine_tuning.sh
scripts_regression/submit_fine_tuning.sh
Feel free to cite this work if you find it useful to you!
@inproceedings{liu2022pretraining,
title={Pre-training Molecular Graph Representation with 3D Geometry},
author={Shengchao Liu and Hanchen Wang and Weiyang Liu and Joan Lasenby and Hongyu Guo and Jian Tang},
booktitle={International Conference on Learning Representations},
year={2022},
url={https://openreview.net/forum?id=xQUe1pOKPam}
}