This repository contains the POLYGON framework, a de novo molecular generator for polypharmacology. Akin to de novo portait generation, POLYGON attempts to optimize the chemical space for multiple protein target domains.
The codebase is primarily adapted from two excellent de novo molecular design frameworks:
GuacaMol for reward based reinforcement learning: https://github.com/BenevolentAI/guacamol
MOSES for the VAE implementation: https://github.com/molecularsets/moses
A key resource to the POLYGON framework is experimental binding data of small molecule ligands. We use the BindingDB as a source for this information, which can be found here: https://www.bindingdb.org/rwd/bind/chemsearch/marvin/Download.jsp
Input molecule training datasets are available from the GuacaMol package: https://github.com/BenevolentAI/guacamol
POLYGON has been testing on Python version 3.9.16.
Installation of POLYGON with pip will automatically install the necessary dependencies, which are:
conda install -c conda-force rdkit
conda install pytorch::pytorch -c pytorch
conda install numpy pandas scikit-learn
git clone https://github.com/bpmunson/polygon.git
cd polygon
pip install .
optionally install cudatoolkit for gpu acceleration in pytorch for example:
conda install cudatoolkit=11.1 -c conda-forge
or see https://pytorch.org/ for specific installation instructions.
Installation time is on the order of minutes.
Example Usage:
Pretrain VAE to encode chemical embedding:
polygon train \
--train_data ../data/guacamol_v1_train.smiles \
--log_file log.txt \
--save_frequency 25 \
--model_save model.pt \
--n_epoch 200 \
--n_batch 1024 \
--debug \
--d_dropout 0.2 \
--device cpu
Train Ligand Binding Models for Two Protein Targets
polygon train_ligand_binding_model \
--uniprot_id Q02750
--binding_db_path BindingDB_All.csv
--output_path Q02750_ligand_binding.pkl
polygon train_ligand_binding_model \
--uniprot_id P42345
--binding_db_path BindingDB_All.csv
--output_path P42345_ligand_binding.pkl
Use the chemical embedding to design polypharmacology compounds
polygon generate \
--model_path ../data/pretrained_vae_model.pt \
--scoring_definition scoring_definition.csv \
--max_len 100 \
--n_epochs 200 \
--mols_to_sample 8192 \
--optimize_batch_size 512 \
--optimize_n_epochs 2 \
--keep_top 4096 \
--opti gauss \
--outF molecular_generation \
--device cpu \
--save_payloads \
--n_jobs 4 \
--debug
The expected runtime for POLYGON is on the order of hours.
POLYGON will output designs as SMILES strings in a text file. For example:
$ head GDM_final_molecules.txt
Fc1cc(F)cc(CC(Nc2ccc3ncccc3c2)c2cccnc2)c1
N[SH](=O)(O)c1cccc(S(=O)(=O)O)c1
N#Cc1cc(C(N)=NO)ccc1Nc1nccc2ccnn12
CN(CN=C(O)c1ccco1)Nc1nccs1