This repository contains a reference implementation to preprocess the data, as well as to train and apply the de novo design models introduced in Kenneth Atz, Leandro Cotos, Clemens Isert, Maria Håkansson, Dorota Focht, Mattis Hilleke, David F. Nippa, Michael Iff, Jann Ledergerber, Carl C. G. Schiebroek, Valentina Romeo, Jan A. Hiss, Daniel Merk, Petra Schneider, Bernd Kuhn, Uwe Grether, & Gisbert Schneider, Nat. Commun., 15, 3408 (2024).
Create and activate the dragonfly environment.
cd envs/
conda env create -f environment.yml
conda activate dragonfly_gen
poetry install --no-root
Add the "dragonfly_gen path" as PYTHONPATH to your ~/.bashrc
file.
export PYTHONPATH="${PYTHONPATH}:<YOUR_PATH>/dragonfly_gen/"
Source your ~/.bashrc
.
source `~/.bashrc`
conda activate dragonfly_gen
Test your installation by running test_pyg.py
.
python test_pyg.py
>>> torch_geometric.__version__: 2.3.0
>>> torch.__version__: 1.13.1
>>> rdkit.__version__: 2022.09.5
To preprocess the binding site for a given protein stored as a PDB file and its ligand as an SDF file in the input/
directory, use the following commands:
cd genfromstructure/
ls input/
>>> 3g8i_ligand.sdf 3g8i_protein.pdb
Next, preprocess the files using preprocesspdb.py
:
python preprocesspdb.py -pdb_file 3g8i_protein -mol_file 3g8i_ligand -pdb_key 3g8i
>>> Number of embedded atoms: 158 / 158
>>> Writing input/3g8i.h5
After preprocessing, apply Dragonfly using sampling.py
.
-config 701
will sample molecules biased by the properties of the ligand in the SDF. Properties include molecular weight, rotatable bonds, hydrogen bond acceptors, hydrogen bond donors, polar surface area, and lipophilicity expressed as MolLogP.
-config 991
will sample molecules unbiased by the properties.
python sampling.py -config 701 -epoch 151 -T 0.5 -pdb 3g8i -num_mols 100
Sampling 100 molecules:
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:06<00:00, 14.31it/s]
Number of valid, unique and novel molecules: 88
python sampling.py -config 991 -epoch 163 -T 0.5 -pdb 3g8i -num_mols 100
Sampling 100 molecules:
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:06<00:00, 13.54it/s]
Number of valid, unique and novel molecules: 99
The generated molecules are saved in the output/
directory:
ls output/
3g8i.csv
For generating SELFIES run the following command.
-config 901
will sample molecules biased by the properties of the ligand in the SDF. Properties include molecular weight, rotatable bonds, hydrogen bond acceptors, hydrogen bond donors, polar surface area, and lipophilicity expressed as MolLogP.
python sampling.py -config 901 -epoch 194 -T 0.5 -pdb 3g8i -num_mols 100
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:08<00:00, 12.13it/s]
Number of valid, unique and novel molecules: 100
The generated molecules are saved in the output/
directory:
ls output/
3g8i.csv
In the genfromligand/
directory, Dragonfly can be applied to generate molecules based on a template SMILES string, using the following command.
-config 603
will sample molecules biased by the properties of the ligand in the SMILES-string. Properties include molecular weight, rotatable bonds, hydrogen bond acceptors, hydrogen bond donors, polar surface area, and lipophilicity expressed as MolLogP.
-config 680
will sample molecules unbiased by the properties.
cd genfromligand/
python sampling.py -config 603 -epoch 305 -T 0.5 -smi_id rosiglitazone -smi "CN(CCOC1=CC=C(C=C1)CC2C(=O)NC(=O)S2)C3=CC=CC=N3" -num_mols 100
Here is your template SMILES: CN(CCOC1=CC=C(C=C1)CC2C(=O)NC(=O)S2)C3=CC=CC=N3
Sampling 100 molecules:
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:05<00:00, 19.28it/s]
Number of valid, unique and novel molecules: 80
python sampling.py -config 680 -epoch 314 -T 0.5 -smi_id rosiglitazone -smi "CN(CCOC1=CC=C(C=C1)CC2C(=O)NC(=O)S2)C3=CC=CC=N3" -num_mols 100
Here is your template SMILES: CN(CCOC1=CC=C(C=C1)CC2C(=O)NC(=O)S2)C3=CC=CC=N3
Sampling 100 molecules:
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:05<00:00, 18.81it/s]
Number of valid, unique and novel molecules: 95
The generated molecules are saved in the output/
directory:
ls output/
rosiglitazone.csv
For generating SELFIES run the following command.
-config 803
will sample molecules biased by the properties of the ligand in the SMILES-string. Properties include molecular weight, rotatable bonds, hydrogen bond acceptors, hydrogen bond donors, polar surface area, and lipophilicity expressed as MolLogP.
python sampling.py -config 803 -epoch 341 -T 0.5 -smi_id rosiglitazone -smi "CN(CCOC1=CC=C(C=C1)CC2C(=O)NC(=O)S2)C3=CC=CC=N3" -num_mols 100
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:08<00:00, 12.13it/s]
Number of valid, unique and novel molecules: 100
The generated molecules are saved in the output/
directory:
ls output/
rosiglitazone.csv
To rank the generated molecules based on pharmacophore similarity to the template, navigate to ranking/qsar/
and use the following command:
cd ranking/qsar/
python cats_similarity_ranking.py -smi_file ../../genfromligand/output/rosiglitazone.csv -query "CN(CCOC1=CC=C(C=C1)CC2C(=O)NC(=O)S2)C3=CC=CC=N3"
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [00:00<00:00, 631.86it/s]
The ranked molecules are stored here: ../../genfromligand/output/rosiglitazone_cats.csv
The software was developed at ETH Zurich and is licensed by the AGPL-3.0 license, i.e. described in LICENSE
.
@article{atz2023deep,
title={Prospective de novo drug design with deep interactome learning},
author={Atz, Kenneth and Mu{\~n}oz, Leandro Cotos and Isert, Clemens and H{\aa}kansson, Maria and Focht, Dorota and Hilleke, Mattis and Nippa, David F and Iff, Michael and Ledergerber, Jann and Schiebroek, Carl CG and Grether, Uwe and Schneider, Gisbert and others},
year={2024}
journal = {Nat. Commun.},
publisher = {Nature Publishing Group},
volume = 15,
number = 3408
}