LCY02 / ABT-MPNN

An atom-bond transformer-based message passing neural network for molecular property prediction.
MIT License
32 stars 8 forks source link
attention-mechanism molecular-property-prediction mpnn

ABT-MPNN: An atom-bond transformer-based message passing neural network for molecular property prediction

Introduction

This repository provides codes and materials associated with the manuscript ABT-MPNN: An atom-bond transformer-based message passing neural network for molecular property prediction.

Illustration of ABT-MPNN

We acknowledge the paper Yang et al (2019). Analyzing Learned Molecular Representations for Property Prediction. JCIM, 59(8), 3370–3388 and the Chemprop repository (version 1.2.0) which this code leveraged and built on top of.

Dependencies

cuda >= 8.0 + cuDNN
python>=3.6
flask>=1.1.2
gunicorn>=20.0.4
hyperopt>=0.2.3
matplotlib>=3.1.3
numpy>=1.18.1
pandas>=1.0.3
pandas-flavor>=0.2.0
pip>=20.0.2
pytorch>=1.4.0
rdkit>=2020.03.1.0
scipy>=1.4.1
tensorboardX>=2.0
torchvision>=0.5.0
tqdm>=4.45.0
einops>=0.3.2
seaborn>=0.11.1

Install the dependencies via conda: conda env create -f environment.yml conda activate abtmpnn

Data

The data file must be be a CSV file with a header row. For example:

smiles,NR-AR,NR-AR-LBD,NR-AhR,NR-Aromatase,NR-ER,NR-ER-LBD,NR-PPAR-gamma,SR-ARE,SR-ATAD5,SR-HSE,SR-MMP,SR-p53
CCOc1ccc2nc(S(N)(=O)=O)sc2c1,0,0,1,,,0,0,1,0,0,0,0
CCN1C(=O)NC(c2ccccc2)C1=O,0,0,0,0,0,0,0,,0,,0,0
...

Data sets used in our study are available in the data directory of this repository.

Featurization

To save adjacency / distance / Coulomb matrices for a dataset, run:

python save_atom_features.py --data_path <path> --save_dir <dir> --adjacency --coulomb --distance

where <path> is the path to a CSV file containing a dataset, and <dir> is the directory where inter-atomic matrices will be saved. To generate adjacency, distance, Coulomb matrices, specify --adjacency, --distance, --coulomb flags.

For example:

python save_atom_features.py --data_path data/freesolv.csv --save_dir features/freesolv/ --adjacency --coulomb --distance

To save Molecule-Level RDKit 2D Features (CDF-normalized version) for a dataset, run:

python save_features.py --data_path <path1> --save_path <path2> --features_generator rdkit_2d_normalized

where <path1> is the path to a CSV file containing a dataset, and <path2> is the path where molecular-level features will be saved. --rdkit_2d_normalized is the flag to generate CDF-normalized version of the 200 rdkit descriptors.

For example:

python save_features.py --data_path data/freesolv.csv --save_path features/freesolv/rdkit_norm.npz --features_generator rdkit_2d_normalized

Training

To train a ABT-MPNN model, run:

python train.py --data_path <path> --dataset_type <type> --save_dir <dir> --bond_fast_attention --atom_attention --adjacency --adjacency_path <adj_path> --distance --distance_path <dist_path> --coulomb --coulomb_path <clb_path> --normalize_matrices --features_path <molf_path> --no_features_scaling

Notes:

A full list of available command-line arguments can be found in chemprop/args.py

Cross validation

k-fold cross-validation can be run by specifying the --num_folds argument (which is 1 by default).

For example:

python train.py --data_path data/freesolv.csv --dataset_type regression --save_dir data_test/freesolv --bond_fast_attention --atom_attention --adjacency --adjacency_path features/freesolv/adj.npz --distance --distance_path features/freesolv/dist.npz --coulomb --coulomb_path features/freesolv/clb.npz --normalize_matrices --features_path features/freesolv/rdkit_norm.npz --split_type random --no_features_scaling --num_folds 5 --gpu 0

Predicting

To load a trained model and make predictions, run predict.py and specify:

If features were used during training, they must be specified again during prediction using the same type of features as before:

For example:

python predict.py --test_path data/freesolv.csv --checkpoint_dir data_test/freesolv --preds_path data_test/freesolv/pred.csv --adjacency_path features/freesolv/adj.npz --distance_path features/freesolv/dist.npz --coulomb_path features/freesolv/clb.npz --features_path features/freesolv/rdkit_norm.npz --normalize_matrices --no_features_scaling

Visualization of attention weights

To visualize atomic attention and save similarity maps, run see_attention.py and specify:

If features were used during training, they must be specified again during prediction using the same type of features as before:

For example:

python see_attention.py --test_path data/freesolv.csv --checkpoint_dir data_test/freesolv --preds_path data_test/freesolv/pred.csv --viz_dir data_test/freesolv/similarity_maps --adjacency_path features/freesolv/adj.npz --distance_path features/freesolv/dist.npz --coulomb_path features/freesolv/clb.npz --features_path features/freesolv/rdkit_norm.npz --normalize_matrices --no_features_scaling