atzkenneth / dragonfly_gen

De novo drug design with deep interactome learning
https://doi.org/10.1038/s41467-024-47613-w
GNU Affero General Public License v3.0
16 stars 10 forks source link

Prospective de novo drug design with deep interactome learning

python pytorch RDKit badge Code style: black MIT license DOI

This repository contains a reference implementation to preprocess the data, as well as to train and apply the de novo design models introduced in Kenneth Atz, Leandro Cotos, Clemens Isert, Maria Håkansson, Dorota Focht, Mattis Hilleke, David F. Nippa, Michael Iff, Jann Ledergerber, Carl C. G. Schiebroek, Valentina Romeo, Jan A. Hiss, Daniel Merk, Petra Schneider, Bernd Kuhn, Uwe Grether, & Gisbert Schneider, Nat. Commun., 15, 3408 (2024).

1. Environment

Create and activate the dragonfly environment.

cd envs/
conda env create -f environment.yml
conda activate dragonfly_gen
poetry install --no-root

Add the "dragonfly_gen path" as PYTHONPATH to your ~/.bashrc file.

export PYTHONPATH="${PYTHONPATH}:<YOUR_PATH>/dragonfly_gen/"

Source your ~/.bashrc.

source `~/.bashrc`
conda activate dragonfly_gen

Test your installation by running test_pyg.py.

python test_pyg.py 
>>> torch_geometric.__version__: 2.3.0
>>> torch.__version__: 1.13.1
>>> rdkit.__version__: 2022.09.5

2. Sampling from a binding site

To preprocess the binding site for a given protein stored as a PDB file and its ligand as an SDF file in the input/ directory, use the following commands:

cd genfromstructure/
ls input/
>>> 3g8i_ligand.sdf 3g8i_protein.pdb

Next, preprocess the files using preprocesspdb.py:

python preprocesspdb.py -pdb_file 3g8i_protein -mol_file 3g8i_ligand -pdb_key 3g8i
>>> Number of embedded atoms: 158 / 158
>>> Writing input/3g8i.h5

After preprocessing, apply Dragonfly using sampling.py.

-config 701 will sample molecules biased by the properties of the ligand in the SDF. Properties include molecular weight, rotatable bonds, hydrogen bond acceptors, hydrogen bond donors, polar surface area, and lipophilicity expressed as MolLogP.

-config 991 will sample molecules unbiased by the properties.

python sampling.py -config 701 -epoch 151 -T 0.5 -pdb 3g8i -num_mols 100
Sampling 100 molecules:
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|  100/100 [00:06<00:00, 14.31it/s]
Number of valid, unique and novel molecules: 88
python sampling.py -config 991 -epoch 163 -T 0.5 -pdb 3g8i -num_mols 100
Sampling 100 molecules:
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|  100/100 [00:06<00:00, 13.54it/s]
Number of valid, unique and novel molecules: 99

The generated molecules are saved in the output/ directory:

ls output/
3g8i.csv

For generating SELFIES run the following command.

-config 901 will sample molecules biased by the properties of the ligand in the SDF. Properties include molecular weight, rotatable bonds, hydrogen bond acceptors, hydrogen bond donors, polar surface area, and lipophilicity expressed as MolLogP.

python sampling.py -config 901 -epoch 194 -T 0.5 -pdb 3g8i -num_mols 100
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|  100/100 [00:08<00:00, 12.13it/s]
Number of valid, unique and novel molecules: 100

The generated molecules are saved in the output/ directory:

ls output/
3g8i.csv

3. Sampling from a template ligand

In the genfromligand/ directory, Dragonfly can be applied to generate molecules based on a template SMILES string, using the following command.

-config 603 will sample molecules biased by the properties of the ligand in the SMILES-string. Properties include molecular weight, rotatable bonds, hydrogen bond acceptors, hydrogen bond donors, polar surface area, and lipophilicity expressed as MolLogP.

-config 680 will sample molecules unbiased by the properties.

cd genfromligand/
python sampling.py -config 603 -epoch 305 -T 0.5 -smi_id rosiglitazone -smi "CN(CCOC1=CC=C(C=C1)CC2C(=O)NC(=O)S2)C3=CC=CC=N3" -num_mols 100
Here is your template SMILES: CN(CCOC1=CC=C(C=C1)CC2C(=O)NC(=O)S2)C3=CC=CC=N3
Sampling 100 molecules:
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|  100/100 [00:05<00:00, 19.28it/s]
Number of valid, unique and novel molecules: 80
python sampling.py -config 680 -epoch 314 -T 0.5 -smi_id rosiglitazone -smi "CN(CCOC1=CC=C(C=C1)CC2C(=O)NC(=O)S2)C3=CC=CC=N3" -num_mols 100
Here is your template SMILES: CN(CCOC1=CC=C(C=C1)CC2C(=O)NC(=O)S2)C3=CC=CC=N3
Sampling 100 molecules:
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:05<00:00, 18.81it/s]
Number of valid, unique and novel molecules: 95

The generated molecules are saved in the output/ directory:

ls output/
rosiglitazone.csv

For generating SELFIES run the following command.

-config 803 will sample molecules biased by the properties of the ligand in the SMILES-string. Properties include molecular weight, rotatable bonds, hydrogen bond acceptors, hydrogen bond donors, polar surface area, and lipophilicity expressed as MolLogP.

python sampling.py -config 803 -epoch 341 -T 0.5 -smi_id rosiglitazone -smi "CN(CCOC1=CC=C(C=C1)CC2C(=O)NC(=O)S2)C3=CC=CC=N3" -num_mols 100
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|  100/100 [00:08<00:00, 12.13it/s]
Number of valid, unique and novel molecules: 100

The generated molecules are saved in the output/ directory:

ls output/
rosiglitazone.csv

4. Rank generated molecules based on pharmacophore similarity to the template

To rank the generated molecules based on pharmacophore similarity to the template, navigate to ranking/qsar/ and use the following command:

cd ranking/qsar/
python cats_similarity_ranking.py -smi_file ../../genfromligand/output/rosiglitazone.csv -query "CN(CCOC1=CC=C(C=C1)CC2C(=O)NC(=O)S2)C3=CC=CC=N3"
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [00:00<00:00, 631.86it/s]
The ranked molecules are stored here: ../../genfromligand/output/rosiglitazone_cats.csv

5. License

The software was developed at ETH Zurich and is licensed by the AGPL-3.0 license, i.e. described in LICENSE.

6. Citation

@article{atz2023deep,
  title={Prospective de novo drug design with deep interactome learning},
  author={Atz, Kenneth and Mu{\~n}oz, Leandro Cotos and Isert, Clemens and H{\aa}kansson, Maria and Focht, Dorota and Hilleke, Mattis and Nippa, David F and Iff, Michael and Ledergerber, Jann and Schiebroek, Carl CG and Grether, Uwe and Schneider, Gisbert and others},
  year={2024}
  journal      = {Nat. Commun.},
    publisher    = {Nature Publishing Group},
    volume       = 15,
    number       = 3408
}