coleygroup / del_qsar

MIT License
29 stars 9 forks source link

DEL QSAR

Prediction of enrichment from molecular structure, using DNA-encoded libraries and machine learning with a probabilistic loss function

Dependencies

Instructions

Initial set-up

Training and evaluating models

To train/evaluate models, navigate to the experiments folder. The following is a general command for running one of the scripts:

python <script_name.py> --csv </path/to/data.csv from experiments/datasets folder> --out <experiment label (name of results subfolder to create)> --device <device (set to cuda:0 if using GPU)>

Additional command-line arguments include:

The script depends on the dataset/model type/task. Further details are provided below, including more specific command-line arguments.

Regression models and binary classifiers:

Scripts:

Additional command-line arguments include:

For fingerprint featurization (if reading from an HDF5 file with stored fingerprints):

For hyperparameter optimization (including triazine_MPN_LR_tuning.py):

For single_model_run.py:

For directed message passing neural networks:

For directed message passing neural networks specifically on the triazine sEH and SIRT2 datasets:

For triazine_MPN.py:

For binary classifiers:

Evaluating trained regression models as binary classifiers, or using MSE loss/rank correlation coefficient

Scripts:

These scripts require (1) an experiments/models folder with saved regression models (.torch files) organized by dataset/model type and named by data split/seed, as follows (for brevity, only filenames for the random splits are shown; cycle-split models should also be included, replacing random with cycle1, ..., cycle12, ..., cycle123 in the filename):

└── models
    ├── DD1S_CAIX
    │   └── D-MPNN
    │   │   └── random_seed_0.torch 
    │   │   └── random_seed_1.torch
    │   │   └── random_seed_2.torch
    │   │   └── random_seed_3.torch
    │   │   └── random_seed_4.torch
    │   │   
    │   └── D-MPNN_pt
    │   │   └── (same as for models/DD1S_CAIX/D-MPNN)
    │   │
    │   └── FP-FFNN
    │   │   └── (same as for models/DD1S_CAIX/D-MPNN)
    │   │
    │   └── FP-FFNN_pt
    │   │   └── (same as for models/DD1S_CAIX/D-MPNN)
    │   │
    │   └── OH-FFNN
    │   │   └── (same as for models/DD1S_CAIX/D-MPNN)
    │   │
    │   └── OH-FFNN_pt
    │       └── (same as for models/DD1S_CAIX/D-MPNN)
    │    
    ├── triazine_sEH
    │   └── D-MPNN
    │   │   └── random_seed_0.torch
    │   │   └── random_seed_1.torch
    │   │   └── random_seed_2.torch
    │   │ 
    │   └── D-MPNN_pt
    │   │   └── (same as for models/triazine_sEH/D-MPNN)
    │   │
    │   └── FP-FFNN
    │   │   └── random_seed_0.torch
    │   │   └── random_seed_1.torch
    │   │   └── random_seed_2.torch
    │   │   └── random_seed_3.torch
    │   │   └── random_seed_4.torch
    │   │   
    │   └── FP-FFNN_pt
    │   │   └── (same as for models/triazine_sEH/FP-FFNN)
    │   │  
    │   └── OH-FFNN
    │   │   └── (same as for models/triazine_sEH/FP-FFNN)
    │   │  
    │   └── OH-FFNN_pt
    │       └── (same as for models/triazine_sEH/FP-FFNN)
    │
    ├── triazine_SIRT2
    │   └── (same as for models/triazine_sEH)
    │
    └── triazine_sEH_SIRT2_multi-task
        └── D-MPNN
        │   └── random_seed_0.torch
        │   └── random_seed_1.torch
        │   └── random_seed_2.torch
        │
        └── FP-FFNN
        │   └── random_seed_0.torch
        │   └── random_seed_1.torch
        │   └── random_seed_2.torch
        │   └── random_seed_3.torch
        │   └── random_seed_4.torch
        │
        └── OH-FFNN
            └── (same as for models/triazine_sEH_SIRT2_multi-task/FP-FFNN)

(2) a csv with the hyperparameter values of the saved models, formatted like the following (example values are shown):

dataset model type seed split layer sizes dropout depth hidden size FFN num layers
DD1S_CAIX D-MPNN 0 random 0.1 6 1300 3
triazine_sEH FP-FFNN_pt 1 random 128, 128 0.35
triazine_SIRT2 OH-FFNN 1 random 512, 256, 128 0.45

For each of the above scripts, each run evaluates all regression models (only on random splits for evaluating models as binary classifiers) for a given dataset and model type. bin_eval.py and bin_eval_multiple_thresholds.py evaluate the models as binary classifiers; mse_loss_eval.py evaluates models using MSE loss; rank_corr_coeff_eval.py evaluates models using Spearman's rank correlation coefficient.

Additional command-line arguments include:

For D-MPNN and D-MPNN_pt models:

Specifically for bin_eval.py and bin_eval_multiple_thresholds.py:

Specifically for bin_eval_multiple_thresholds.py:

KNN and random baselines

(1) To train and evaluate baseline k-nearest-neighbors (KNN) regression models, run the following scripts for each dataset:

(2) To run random baselines, run the script random_baseline.py, specifying via --hyperparams the filename of a csv (in the experiments folder) with the hyperparameter values of the saved models (for the specific format of the hyperparameters file, see (2) under "Evaluating trained regression models as binary classifiers, or using MSE loss/rank correlation coefficient"). Also specify the data split (--splitter), the type of random baseline (--random_type <'shuffle_preds' or 'predict_all_ones'>), and the metric used for model evaluation (--eval_metric <'NLL', 'MSE', or 'rank_corr_coeff'>).

Visualizations

Scripts and notebooks for visualizations can be found in the experiments/visualizations folder

Atom-centered Gaussian visualizations for fingerprint-based models

To visualize atomic contributions to the predictions of a trained fingerprint-based model, run:

python visualize_smis.py --model_path </path/to/saved_model.torch from experiments folder> --cpd_ids <compound ID(s) of the compound(s) to visualize> --csv </path/to/data.csv from experiments/datasets folder> --fps_h5 </path/to/file_with_stored_fingerprints.h5 from experiments folder> --out <label for results subfolder> --layer_sizes <hidden layer sizes> --dropout <dropout rate>

where --layer_sizes and --dropout are hyperparameters of the saved model

Bit and substructure analysis

This visualization requires:

To run the visualization, run all cells in Single substructure analysis.ipynb and Substructure pair analysis.ipynb (note: the substructure pair analysis depends on results from running the single substructure analysis).

Fingerprint and model filenames/paths, etc. may be modified as necessary in the fourth cell from the top in each of the Jupyter notebooks. Information about the bits and substructures is printed out in the notebook; visualizations of the substructures are saved to png files, along with histograms and bar graphs.

For each dataset in the single substructure analysis, a command-line alternative to running the first two cells under "Get and visualize substructures" is to run:

python single_substructure_analysis_get_substructures.py --csv </path/to/data.csv from experiments/datasets folder> --fps_h5 </path/to/file_with_stored_fingerprints.h5 from experiments folder> --dataset_label <'DD1S_CAIX', 'triazine_sEH', or 'triazine_SIRT2'> --seed <random seed for data splitting and weight initialization (only used for file naming in this context)> --bits_of_interest <bits of interest (top 5 and bottom 3 bits)>

After running this script, the rest of the single substructure analysis for the given dataset can be resumed in the Jupyter notebook, starting with the third cell (currently commented-out) under "Get and visualize substructures."

UMAP

  1. Navigate to the experiments/visualizations folder
  2. Generate 4096-bit fingerprints for the PubChem compounds (see pubchem_smiles.npy for the compounds' SMILES strings) by running python generate_pubchem_fps.py. Also generate 4096-bit fingerprints for DOS-DEL-1 and the triazine library, using the script fps_preprocessing.py in the experiments folder (for instructions, see the "Initial set-up" section above). The resulting files with stored fingerprints should be moved to the experiments/visualizations folder.
  3. To train and apply a UMAP embedding, run:
    python UMAP.py --num_threads <number of threads> --pubchem_fps_h5 <name of HDF5 file with stored 4096-bit fingerprints for PubChem> --DD1S_fps_h5 <name of HDF5 file with stored 4096-bit fingerprints for DOS-DEL-1> --triazine_fps_h5 <name of HDF5 file with stored 4096-bit fingerprints for the triazine library>
  4. To generate UMAP plots, run all cells in UMAP plots.ipynb

Various plots

Outliers for the DD1S CAIX dataset

1) To train a FP-KNN on the entire dataset, run FP-KNN_train_on_all_DD1S_CAIX.py (specify a filename for the results folder with --out; otherwise, leave the command-line arguments as the default values). The script saves the trained model in a subfolder of the results folder. 2) To walk through the process used to identify example outliers in the DD1S CAIX dataset, see Identifying DD1S CAIX outliers.ipynb in the experiments folder 3) Run DD1S_CAIX_outliers_get_nearest_neighbor.py (use --model_path to specify the path, from the experiments folder, to the saved FP-KNN trained on the entire DD1S CAIX dataset) to obtain the index and SMILES string of the nearest neighbor in the DD1S CAIX dataset for each example outlier