Prediction of enrichment from molecular structure, using DNA-encoded libraries and machine learning with a probabilistic loss function
triazine_lib_sEH_SIRT2_QSAR.csv.gz
in the experiments/datasets
folderpubchem_smiles.npy.gz
in the experiments/visualizations
folderexperiments
folder and run:
python fps_preprocessing.py --csv </path/to/data.csv from experiments/datasets folder> --fps_h5 <filename for HDF5 file to create> --fp_size <fingerprint size (number of bits)>
To train/evaluate models, navigate to the experiments
folder.
The following is a general command for running one of the scripts:
python <script_name.py> --csv </path/to/data.csv from experiments/datasets folder> --out <experiment label (name of results subfolder to create)> --device <device (set to cuda:0 if using GPU)>
Additional command-line arguments include:
--exp <column header(s) for barcode counts in experiment with protein(s) of interest>
--beads <column header(s) for barcode counts in beads-only control without protein(s) of interest>
--exp
The script depends on the dataset/model type/task. Further details are provided below, including more specific command-line arguments.
Scripts:
single_model_run.py
(training and evaluating a model with user-provided hyperparameter values)hyperparameter_optimization_DD1S_CAIX_FFN.py
(optimizing feed-forward neural networks)hyperparameter_optimization_DD1S_CAIX_MPN.py
(optimizing directed message passing neural networks)hyperparameter_optimization_triazine_FFN.py
(optimizing feed-forward neural networks)triazine_MPN_LR_tuning.py
(tuning learning rate for directed message passing neural networks)triazine_MPN.py
(training and evaluating directed message passing neural networks)Additional command-line arguments include:
--featurizer <type of molecule featurization>
'fingerprint'
or 'onehot'
'graph'
--splitter <type of data split for train/validation/test>
'random'
'cycle1'
, 'cycle2'
, 'cycle3'
(for example: 'cycle1','cycle3'
)--seed <random seed for data splitting and weight initialization>
--task_type <task type>
'regression'
'classification'
--loss_fn_train <loss function to use during training>
'BCE'
'nlogprob'
(probabilistic loss function)'MSE'
(mean squared error)--max_epochs <maximum number of epochs>
--patience <patience>
--max_norm <max norm>
For fingerprint featurization (if reading from an HDF5 file with stored fingerprints):
--fps_h5 </path/to/file_with_stored_fingerprints.h5 from experiments folder>
For hyperparameter optimization (including triazine_MPN_LR_tuning.py
):
--n_trials <number of trials to run/hyperparameter sets to try>
For single_model_run.py
:
--model_type <model type>
'MLP'
(feed-forward neural network)'MoleculeModel'
(directed message passing neural network)--lr <initial learning rate>
--dropout <dropout rate>
--eval_metric <loss function used to evaluate the model>
'NLL'
(negative log-likelihood)'MSE'
(mean-squared error)--layer_sizes <hidden layer sizes>
--depth <number of message-passing steps>
--hidden_size <size of hidden layers>
--ffn_num_layers <number of feed-forward network layers>
For directed message passing neural networks:
--num_workers <number of workers for loading data>
For directed message passing neural networks specifically on the triazine sEH and SIRT2 datasets:
--depth <number of message-passing steps>
--hidden_size <size of hidden layers>
--ffn_num_layers <number of feed-forward network layers>
For triazine_MPN.py
:
--lr <initial learning rate>
--dropout <dropout rate>
For binary classifiers:
--threshold_type <type of threshold for determining ground truth labels>
'percentile'
'fixed'
'percentile'
--threshold_val <threshold value; percentile or exact value>
99.5
to define the top 0.5% of training set compounds as enrichedScripts:
bin_eval.py
(fixed threshold)bin_eval_multiple_thresholds.py
(multiple thresholds)mse_loss_eval.py
rank_corr_coeff_eval.py
These scripts require
(1) an experiments/models
folder with saved regression models (.torch
files) organized by dataset/model type and named by data split/seed, as follows (for brevity, only filenames for the random splits are shown; cycle-split models should also be included, replacing random
with cycle1
, ..., cycle12
, ..., cycle123
in the filename):
└── models
├── DD1S_CAIX
│ └── D-MPNN
│ │ └── random_seed_0.torch
│ │ └── random_seed_1.torch
│ │ └── random_seed_2.torch
│ │ └── random_seed_3.torch
│ │ └── random_seed_4.torch
│ │
│ └── D-MPNN_pt
│ │ └── (same as for models/DD1S_CAIX/D-MPNN)
│ │
│ └── FP-FFNN
│ │ └── (same as for models/DD1S_CAIX/D-MPNN)
│ │
│ └── FP-FFNN_pt
│ │ └── (same as for models/DD1S_CAIX/D-MPNN)
│ │
│ └── OH-FFNN
│ │ └── (same as for models/DD1S_CAIX/D-MPNN)
│ │
│ └── OH-FFNN_pt
│ └── (same as for models/DD1S_CAIX/D-MPNN)
│
├── triazine_sEH
│ └── D-MPNN
│ │ └── random_seed_0.torch
│ │ └── random_seed_1.torch
│ │ └── random_seed_2.torch
│ │
│ └── D-MPNN_pt
│ │ └── (same as for models/triazine_sEH/D-MPNN)
│ │
│ └── FP-FFNN
│ │ └── random_seed_0.torch
│ │ └── random_seed_1.torch
│ │ └── random_seed_2.torch
│ │ └── random_seed_3.torch
│ │ └── random_seed_4.torch
│ │
│ └── FP-FFNN_pt
│ │ └── (same as for models/triazine_sEH/FP-FFNN)
│ │
│ └── OH-FFNN
│ │ └── (same as for models/triazine_sEH/FP-FFNN)
│ │
│ └── OH-FFNN_pt
│ └── (same as for models/triazine_sEH/FP-FFNN)
│
├── triazine_SIRT2
│ └── (same as for models/triazine_sEH)
│
└── triazine_sEH_SIRT2_multi-task
└── D-MPNN
│ └── random_seed_0.torch
│ └── random_seed_1.torch
│ └── random_seed_2.torch
│
└── FP-FFNN
│ └── random_seed_0.torch
│ └── random_seed_1.torch
│ └── random_seed_2.torch
│ └── random_seed_3.torch
│ └── random_seed_4.torch
│
└── OH-FFNN
└── (same as for models/triazine_sEH_SIRT2_multi-task/FP-FFNN)
(2) a csv with the hyperparameter values of the saved models, formatted like the following (example values are shown):
dataset | model type | seed | split | layer sizes | dropout | depth | hidden size | FFN num layers |
---|---|---|---|---|---|---|---|---|
DD1S_CAIX | D-MPNN | 0 | random | 0.1 | 6 | 1300 | 3 | |
triazine_sEH | FP-FFNN_pt | 1 | random | 128, 128 | 0.35 | |||
triazine_SIRT2 | OH-FFNN | 1 | random | 512, 256, 128 | 0.45 |
For each of the above scripts, each run evaluates all regression models (only on random splits for evaluating models as binary classifiers) for a given dataset and model type. bin_eval.py
and bin_eval_multiple_thresholds.py
evaluate the models as binary classifiers; mse_loss_eval.py
evaluates models using MSE loss; rank_corr_coeff_eval.py
evaluates models using Spearman's rank correlation coefficient.
Additional command-line arguments include:
--model_type <model type>
'D-MPNN'
'D-MPNN_pt'
'FP-FFNN'
'FP-FFNN_pt'
'OH-FFNN'
'OH-FFNN_pt'
--hyperparams </path/to/hyperparameter_values_of_saved_models.csv from experiments folder>
For D-MPNN
and D-MPNN_pt
models:
--num_workers <number of workers for loading data>
Specifically for bin_eval.py
and bin_eval_multiple_thresholds.py
:
--random_split_only <True or False>
True
--random_guess <True or False>
--model_type
to 'FP-FFNN'
when generating random-guess baseline--threshold_type <type of threshold for determining ground truth labels>
'percentile'
'fixed'
'percentile'
--threshold_val <threshold value; percentile or exact value>
99.5
to define the top 0.5% of training set compounds as enrichedSpecifically for bin_eval_multiple_thresholds.py
:
--num_thresholds <number of (logarithmically spaced) thresholds to try>
20
--start_idx <threshold index, from 0 to 19, to start at>
1
or higher if resuming a job--stop_idx <threshold index, from 1 to 20, to stop at>
19
or lower if stopping early(1) To train and evaluate baseline k-nearest-neighbors (KNN) regression models, run the following scripts for each dataset:
knn_train_DD1S_CAIX.py
, specifying the type of molecule featurization (--featurizer <'onehot' or 'fingerprint'>
) and the number of neighbors (--n_neighbors <number of neighbors; values of 1, 3, 5, 7, 9 were tested>
). The script iterates through all data splits and random seeds, and saves the trained models in the results folder.experiments/models/DD1S_CAIX/OH-KNN
(for the onehot KNNs) and experiments/models/DD1S_CAIX/FP-KNN
(for the fingerprint-based KNNs), with subfolders k_1
, k_3
, k_5
, k_7
, k_9
(for the different tested values of n_neighbors
); move the corresponding saved models (.joblib
files) to these folders. The filename for each saved model should be of the form <data split>_seed_<random seed>.joblib
, where the possible data split names arerandom
, cycle1
, ..., cycle12
, ..., cycle123
.DD1S_CAIX_knn_eval.py
, specifying the metric used for evaluation (--eval_metric <'NLL', 'MSE', or 'rank_corr_coeff'>
), the type of featurization (--featurizer <'onehot' or 'fingerprint'>
), the data split (--splitter
), and random seed (--seed
). The script iterates through all tested n_neighbors
values of 1, 3, 5, 7, 9.knn_train_triazine.py
, specifying the type of molecule featurization (--featurizer <'onehot' or 'fingerprint'>
), data split (--splitter
), and random seed (--seed
). To reproduce results, keep the default value of 9
for --n_neighbors
. The script saves the trained model in the results folder.experiments/models/<dataset>/OH-KNN/k_9
and experiments/models/<dataset>/FP-KNN/k_9
(where <dataset>
is triazine_sEH
or triazine_SIRT2
, and the subfolder k_9
refers to the fixed value of 9 for n_neighbors
); move the corresponding saved models (.joblib
files) to these folders. KNNs were trained only for random seed 0 on the triazine datasets; the filename for each saved model should be of the form <data split>_seed_0.joblib
.triazine_knn_generate_test_preds.py
to generate and save test-set predictions, specifying the metric used for evaluation (--eval_metric <'NLL', 'MSE', or 'rank_corr_coeff'>
), the type of featurization (--featurizer <'onehot' or 'fingerprint'>
), the data split (--splitter
), and random seed (--seed
). The script uses 9
as the value for n_neighbors
. The generated test-set predictions are saved in the results folder.triazine_knn_eval_on_test_preds.py
, specifying the path to the saved test-set predictions (from the /experiments/results
folder), the metric used for evaluation (--eval_metric <'NLL', 'MSE', or 'rank_corr_coeff'>
), and the data split (--splitter
).(2) To run random baselines, run the script random_baseline.py
, specifying via --hyperparams
the filename of a csv (in the experiments
folder) with the hyperparameter values of the saved models (for the specific format of the hyperparameters file, see (2) under "Evaluating trained regression models as binary classifiers, or using MSE loss/rank correlation coefficient"). Also specify the data split (--splitter
), the type of random baseline (--random_type <'shuffle_preds' or 'predict_all_ones'>
), and the metric used for model evaluation (--eval_metric <'NLL', 'MSE', or 'rank_corr_coeff'>
).
Scripts and notebooks for visualizations can be found in the experiments/visualizations
folder
To visualize atomic contributions to the predictions of a trained fingerprint-based model, run:
python visualize_smis.py --model_path </path/to/saved_model.torch from experiments folder> --cpd_ids <compound ID(s) of the compound(s) to visualize> --csv </path/to/data.csv from experiments/datasets folder> --fps_h5 </path/to/file_with_stored_fingerprints.h5 from experiments folder> --out <label for results subfolder> --layer_sizes <hidden layer sizes> --dropout <dropout rate>
where --layer_sizes
and --dropout
are hyperparameters of the saved model
This visualization requires:
compiled_results.xlsx
in the experiments/paper_results
folder
To run the visualization, run all cells in Single substructure analysis.ipynb
and Substructure pair analysis.ipynb
(note: the substructure pair analysis depends on results from running the single substructure analysis).
Fingerprint and model filenames/paths, etc. may be modified as necessary in the fourth cell from the top in each of the Jupyter notebooks. Information about the bits and substructures is printed out in the notebook; visualizations of the substructures are saved to png files, along with histograms and bar graphs.
For each dataset in the single substructure analysis, a command-line alternative to running the first two cells under "Get and visualize substructures" is to run:
python single_substructure_analysis_get_substructures.py --csv </path/to/data.csv from experiments/datasets folder> --fps_h5 </path/to/file_with_stored_fingerprints.h5 from experiments folder> --dataset_label <'DD1S_CAIX', 'triazine_sEH', or 'triazine_SIRT2'> --seed <random seed for data splitting and weight initialization (only used for file naming in this context)> --bits_of_interest <bits of interest (top 5 and bottom 3 bits)>
After running this script, the rest of the single substructure analysis for the given dataset can be resumed in the Jupyter notebook, starting with the third cell (currently commented-out) under "Get and visualize substructures."
experiments/visualizations
folderpubchem_smiles.npy
for the compounds' SMILES strings) by running python generate_pubchem_fps.py
. Also generate 4096-bit fingerprints for DOS-DEL-1 and the triazine library, using the script fps_preprocessing.py
in the experiments
folder (for instructions, see the "Initial set-up" section above). The resulting files with stored fingerprints should be moved to the experiments/visualizations
folder.python UMAP.py --num_threads <number of threads> --pubchem_fps_h5 <name of HDF5 file with stored 4096-bit fingerprints for PubChem> --DD1S_fps_h5 <name of HDF5 file with stored 4096-bit fingerprints for DOS-DEL-1> --triazine_fps_h5 <name of HDF5 file with stored 4096-bit fingerprints for the triazine library>
UMAP plots.ipynb
Loss function plots.ipynb
DD1S_CAIX_NLL_test_losses.csv
, DD1S_CAIX_MSE_test_losses.csv
, DD1S_CAIX_rank_corr_coeffs.csv
, triazine_sEH_NLL_test_losses.csv
, triazine_sEH_MSE_test_losses.csv
, triazine_sEH_rank_corr_coeffs.csv
, triazine_SIRT2_NLL_test_losses.csv
, triazine_SIRT2_MSE_test_losses.csv
, triazine_SIRT2_rank_corr_coeffs.csv
, triazine_multitask_sEH_test_losses.csv
, and triazine_multitask_SIRT2_test_losses.csv
in the experiments
folder model type
, (seed
), split
(data split names recorded as random
, cycle1
, ..., cycle12
, ..., cycle123
), and test performance
(loss or rank correlation coefficient on the test set). The model type names should be recorded as OH-FFNN
, OH-FFNN_pt
, FP-FFNN
, FP-FFNN_pt
, D-MPNN
, and D-MPNN_pt
model type
, (seed
), split
(data split names recorded as random
, cycle123
), and test performance
(loss on the test set). The model type names should be recorded as OH-FFNN_single-task
, FP-FFNN_single-task
, D-MPNN_single-task
, OH-FFNN_multi-task
, FP-FFNN_multi-task
, and D-MPNN_multi-task
Test performance bar graphs and scatter plots.ipynb
DD1S CAIX hyperparameter optimization result histograms
DD1S_CAIX_hyperparameter_optimization_results.csv
(for the D-MPNN / D-MPNN_pt regression models) and bin_DD1S_CAIX_hyperparameter_optimization_results.csv
(for the random split D-MPNN binary classifiers) in the experiments
folder, each formatted like the following (example values are shown):model type | seed | split | depth | FFN num layers | hidden size | dropout |
---|---|---|---|---|---|---|
D-MPNN | 0 | random | 6 | 3 | 1300 | 0.10 |
DD1S CAIX hyperparameter optimization result histograms.ipynb
in the experiments
folder DD1S_CAIX_KNN_k_optimization_results.csv
in the experiments
folder, with column headers model type
(OH-KNN
or FP-KNN
), seed
(0
, 1
, 2
, 3
, or 4
), metric
(NLL
, MSE
, or rank corr coeff
), split
(random
, cycle1
, ..., cycle12
, ..., cycle123
), and k
(optimized number of neighbors)DD1S CAIX KNN k optimization result histograms.ipynb
compiled_results.xlsx
in the experiments/paper_results
folderDD1S CAIX histograms and parity plots.ipynb
(can modify fingerprint and model filenames/paths as necessary, in the fourth cell from the top)compiled_results.xlsx
in the experiments/paper_results
folderDD1S CAIX disynthon parity plots and 1D histograms.ipynb
(can modify fingerprint and model filenames/paths as necessary, in the fourth cell from the top)compiled_results.xlsx
in the experiments/paper_results
folderTriazine parity plots.ipynb
(can modify fingerprint and model filenames/paths as necessary, in the fourth cell from the top)compiled_results.xlsx
in the experiments/paper_results
folderTriazine disynthon parity plots.ipynb
(can modify fingerprint and model filenames/paths as necessary, in the fourth cell from the top)DD1S CAIX cycle 2 distributional shift.ipynb
bin_AUCs.csv
in the experiments
folder. Column headers should be dataset
, model type
, seed
, PR AUC
, and ROC AUC
; datasets should be recorded as DD1S CAIX
, triazine sEH
, triazine SIRT2
; model types should be recorded as OH-FFNN
, OH-FFNN pt
, OH-FFNN bin
, FP-FFNN
, FP-FFNN pt
, FP-FFNN bin
, D-MPNN
, D-MPNN pt
, D-MPNN bin
, Random guess
Fixed threshold bin plots.ipynb
AUCs_multiple_thresholds.csv
in the experiments
folder. Column headers should be dataset
, model type
, top percent
, seed
, PR AUC
, and ROC AUC
; datasets should be recorded as DD1S CAIX
, triazine sEH
, triazine SIRT2
; model types should be recorded as OH-FFNN
, OH-FFNN pt
, FP-FFNN
, FP-FFNN pt
, D-MPNN
, D-MPNN pt
, Random guess
; a top percent
value of x
(for example) means that the top x
% of training set compounds are defined as enriched.Multiple thresholds bin plots.ipynb
compiled_results.xlsx
in the experiments/paper_results
folderTriazine 2D histograms.ipynb
(can modify fingerprint and model filenames/paths as necessary, in the fourth cell from the top)compiled_results.xlsx
in the experiments/paper_results
folderTriazine generalization parity plots.ipynb
(can modify fingerprint and model filenames/paths as necessary, in the fourth cell from the top)1) To train a FP-KNN on the entire dataset, run FP-KNN_train_on_all_DD1S_CAIX.py
(specify a filename for the results folder with --out
; otherwise, leave the command-line arguments as the default values). The script saves the trained model in a subfolder of the results folder.
2) To walk through the process used to identify example outliers in the DD1S CAIX dataset, see Identifying DD1S CAIX outliers.ipynb
in the experiments
folder
3) Run DD1S_CAIX_outliers_get_nearest_neighbor.py
(use --model_path
to specify the path, from the experiments
folder, to the saved FP-KNN trained on the entire DD1S CAIX dataset) to obtain the index and SMILES string of the nearest neighbor in the DD1S CAIX dataset for each example outlier