PatWalters / TS

Thompson Sampling
MIT License
52 stars 9 forks source link

Thompson Sampling for Virtual Screening

This repo accompanies our paper "Thompson Sampling─An Efficient Method for Searching Ultralarge Synthesis on Demand Databases".

Thompson Sampling is an active learning strategy that balances the tradeoff between exploitation and exploration. The code in this repository implements Thompson Sampling as an efficient searching algorithm for screening large, un-enumerated libraries such as Enamine REAL SPACE.

This implementation of Thompson Sampling can be run on any un-enumerated library comprised of reactions and reagents. To run a virtual screen using Thompson Sampling, start by selecting an un-enumerated library to search, and a screening objective to maximize or minimize - e.g. 2D similarity, 3D similarity (such as Openeye's ROCS, or docking, and a query molecule (for 2D and 3D shape similarity) or target protein structure (for docking).

The algorithm begins by constructing prior distributions for the expected value of each reagent in the library. We model the distribution of scores produced by any reagent in the library as a normal distribution, for which we are trying to estimate the expected value (mean), assuming the standard deviation of the distribution is known. We call this the "warmup" period, and start by randomly sampling (making and scoring a molecule with that reagent) each reagent n times. The prior distribution is then constructed by taking the mean and standard deviation of the scores from the n random samples.

Next we repeat the following n times:

The scores and SMILES string for each molecule made and scored are saved and (optionally) written to a file.

run_ts.py - The main file for running Thompson Sampling via command line.

reagent.py - Contains the Reagent class which constructs and updates the prior distribution.

baseline.py - Generates brute force or random comparisons.

evaluators.py - Contains the evaluation functions.

disallow_tracker.py - Contains the class for keeping track of sampled products.

thompson_sampling.py - Contains the ThompsonSampling class that runs Thompson Sampling

Setting up the environment for running Thompson Sampling

Create a new conda environment and install rdkit: conda create -c conda-forge -n <your-env-name> rdkit

Activate your environment and install the rest of the requirements: conda activate <your-env-name> pip install -r requirements.txt

Optionally: install Openeye toolkits to use ROCS scoring function.

Construct a json file with the desired parameters for your run, see example json files in the examples directory. See required and optional parameter explanations below.

How to run Thompson Sampling

python ts_main.py <path-to-json-params-file>.json

Or try one of the example queries:

python ts_main.py examples/amide_fp_sim.json

or

python ts_main.py examples/quinazoline_fp_sim.json

Parameters

Required params:

Optional params: