broadinstitute / adapt

A package for designing activity-informed nucleic acid diagnostics for viruses.
MIT License
27 stars 1 forks source link
bioinformatics diagnostics dna genomics science viral

ADAPT  ·  Build Status codecov PRs Welcome Contributor Covenant MIT License Bioconda Package PyPI package

Activity-informed Design with All-inclusive Patrolling of Targets

ADAPT efficiently designs activity-informed nucleic acid diagnostics for viruses.

In particular, ADAPT designs assays with maximal predicted detection activity, in expectation over a virus's genomic diversity, subject to soft and hard constraints on the assay's complexity and specificity. ADAPT's designs are:


ADAPT outputs a list of assay options ranked by predicted performance. In addition to its objective that maximizes expected activity, ADAPT supports a simpler objective that minimizes the number of probes subject to detecting a specified fraction of diversity.

ADAPT includes a pre-trained model that predicts CRISPR-Cas13a guide detection activity, so ADAPT is directly suited to detection with Cas13a. ADAPT's output also includes amplification primers, e.g., for use with the SHERLOCK platform. The framework and software are compatible with other nucleic acid technologies given appropriate models.

For more information, see our publication that describes ADAPT and evaluates its designs experimentally.

Table of contents


Setting up ADAPT

Dependencies

ADAPT requires:

Using the thermodynamic modules of ADAPT requires:

Using ADAPT with AWS cloud features additionally requires:

Installing ADAPT with pip, as described below, will install NumPy, SciPy, and TensorFlow if they are not already installed. Installing ADAPT with pip with the thermodynamic modules, as described below, will install Primer3-py if it is not already installed as well. Installing ADAPT with pip using the AWS cloud features, as described below, will install Boto3 and Botocore if they are not already installed as well.

If using alignment features in subcommands below, ADAPT also requires a path to an executable of MAFFT.

Setting up a conda environment

Note: This section is optional, but may be useful to users who are new to Python.

It is generally useful to install and run Python packages inside of a virtual environment, especially if you have multiple versions of Python installed or use multiple packages. This can prevent problems when upgrading, conflicts between packages with different requirements, installation issues that arise from having different Python versions available, and more.

One option to manage packages and environments is to use conda. A fast way to obtain conda is to install Miniconda: you can download it here and find installation instructions for it here. For example, on Linux you would run:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Once you have conda, you can create an environment for ADAPT with Python 3.8:

conda create -n adapt python=3.8

Then, you can activate the adapt environment:

conda activate adapt

After the environment is created and activated, you can install ADAPT as described below. You will need to activate the environment each time you use ADAPT.

Downloading and installing

ADAPT is available via Bioconda for GNU/Linux and Windows operating systems and via PyPI for all operating systems.

Before installing ADAPT via Bioconda, we suggest you follow the instructions in Setting up a conda environment to install Miniconda and activate the environment. To install via Bioconda, run the following command:

conda install -c bioconda adapt

If you want to be able to use thermodynamic modules of ADAPT, run the following instead:

conda install -c bioconda "adapt[thermo]"

If you want to be able to use AWS cloud features through ADAPT, run the following instead:

conda install -c bioconda "adapt[AWS]"

For both AWS and thermodynamics, run the following instead:

conda install -c bioconda "adapt[AWS,thermo]"

Before installing ADAPT via PyPI, we suggest you follow the instructions in either the Python documentation or Setting up a conda environment to set up and activate a virtual environment for ADAPT. To install via PyPI, run the following command:

pip install adapt-diagnostics

If you want to be able to use thermodynamic modules of ADAPT, run the following instead:

pip install "adapt-diagnostics[thermo]"

If you want to be able to use AWS cloud features through ADAPT, run the following instead:

pip install "adapt-diagnostics[AWS]"

For both AWS and thermodynamics, run the following instead:

pip install "adapt-diagnostics[AWS,thermo]"

If you wish to modify ADAPT's code, ADAPT can be installed by cloning the repository and installing the package with pip:

git clone git@github.com:broadinstitute/adapt.git
cd adapt
pip install -e .

Depending on your setup (i.e., if you do not have write permissions in the installation directory), you may need to supply --user to pip install.

If you want to be able to use thermodynamic modules of ADAPT, replace the last line with the following:

pip install -e ".[thermo]"

If you want to be able to use AWS cloud features through ADAPT, replace the last line with the following:

pip install -e ".[AWS]"

For both AWS and thermodynamics, replace the last line with the following:

pip install -e ".[AWS,thermo]"

Testing

If you clone this repository, you may want to run tests to ensure your clone is running properly. This package uses Python's unittest framework. To execute all tests, from the home directory of your ADAPT clone, run:

python -m unittest discover

Running on Docker

Note: This section is optional, but may be useful for more advanced users or developers. You will need to install Docker.

If you would like to run ADAPT using a Docker container rather than installing it, you may use one of our pre-built ADAPT images.

For ADAPT without cloud features, use the image ID quay.io/broadinstitute/adapt.

For ADAPT with cloud features, use the image ID quay.io/broadinstitute/adaptcloud.

To pull our Docker image to your computer, run:

docker pull [IMAGE-ID]

To run ADAPT on a Docker container, run:

docker run --rm [IMAGE-ID] "[COMMAND]"

To run with ADAPT memoizing to a local directory, run:

docker run --rm -v /path/to/memo/on/host:/memo [IMAGE-ID] "[COMMAND]"

To run the container interactively (opening a command line to the container), run:

docker run --rm -it [IMAGE-ID]

Using ADAPT

Overview

The main program for designing assays is design.py.

Below, we refer to guides in reference to our pre-trained model for CRISPR-Cas13a guides and our testing of ADAPT's designs with Cas13a. More generally, guides can be thought of as probes to encompass other diagnostic technologies.

design.py requires two subcommands:

design.py [SEARCH-TYPE] [INPUT-TYPE] ...

Required subcommands

SEARCH-TYPE is one of:

INPUT-TYPE is one of:

Positional arguments

The positional arguments — which specify required input to ADAPT — depend on the INPUT-TYPE. These arguments are defined below for each INPUT-TYPE.

If INPUT-TYPE is fasta:
design.py [SEARCH-TYPE] fasta [fasta] [fasta ...] -o [out-tsv] [out-tsv ...]

where [fasta] is a path to an aligned FASTA file for a taxon and [out-tsv] specifies the basename of where to write the output TSV file (without the .tsv suffix). If there are more than one space-separated FASTA, there must be an equivalent number of output TSV files; the i'th output gives designs for the i'th input FASTA.

If INPUT-TYPE is auto-from-args:
design.py [SEARCH-TYPE] auto-from-args [taxid] [segment] [out-tsv]

where [taxid] is an NCBI taxonomy ID, [segment] is a segment label (e.g., 'S') or 'None' if unsegmented, and [out-tsv] specifies where to write the output TSV file.

If INPUT-TYPE is auto-from-file:
design.py [SEARCH-TYPE] auto-from-file [in-tsv] [out-dir]

where [in-tsv] is a path to a file specifying the input taxonomies (run design.py [SEARCH-TYPE] auto-from-file --help for details) and [out-dir] specifies a directory in which to write the outputs.

Details on all arguments

To see details on all the arguments available, run

design.py [SEARCH-TYPE] [INPUT-TYPE] --help

with the particular choice of subcommands substituted in for [SEARCH-TYPE] and [INPUT-TYPE].

Specifying the objective

ADAPT supports two objective functions, specified using the --obj argument:

Details on each are below.

Objective: maximizing activity

Setting --obj maximize-activity tells ADAPT to design sets of guides having maximal activity, in expectation over the input taxon's genomic diversity, subject to soft and hard constraints on the size of the guide set. This is usually our recommended objective, especially with access to a predictive model. With this objective, the following arguments to design.py are relevant:

Note that, when the objective is to maximize activity, this objective requires a predictive model of activity and thus --predict-activity-model-path or --predict-cas13a-activity-model should be specified (details in Miscellaneous key arguments). If you wish to use this objective but cannot use our pre-trained Cas13a model nor another model, see the help message for the argument --use-simple-binary-activity-prediction.

Objective: minimizing complexity

Setting --obj minimize-guides tells ADAPT to minimize the number of guides in an assay subject to constraints on coverage of the input taxon's genomic diversity. With this objective, the following arguments to design.py are relevant:

Enforcing specificity

ADAPT can enforce strict specificity so that designs will distinguish related taxa.

For all INPUT-TYPEs, ADAPT can enforce specificity by parsing the --specific-against-* arguments. When INPUT-TYPE is auto-from-file or fasta, ADAPT will also automatically enforce specificity between taxa/FASTA files using a single specificity index.

To enforce specificity, the following arguments to design.py are important:

Searching for complete targets

When SEARCH-TYPE is complete-targets, ADAPT performs a branch and bound search to find a collection of assay design options. It finds the best N design options for a specified N. Each design option represents a genomic region containing primer pairs and guides between them. There is no set length for the region. The N options are intended to be a diverse (non-overlapping) selection.

Below are key arguments to design.py when SEARCH-TYPE is complete-targets:

Automatically downloading and curating data

When INPUT-TYPE is auto-from-{file,args}, ADAPT will run end-to-end. It fetches and curates genomes, clusters and aligns them, and uses the generated alignment as input for design.

Below are key arguments to design.py when INPUT-TYPE is auto-from-file or auto-from-args:

When using AWS S3 to memoize information across runs (--prep-memoize-dir), the following arguments are also important:

Using custom sequences as input

When INPUT-TYPE is fasta, ADAPT will run on only the sequences specified in the FASTA, without curation.

Below are key arguments to design.py when INPUT-TYPE is fasta:

Weighting sequences

By default, ADAPT bases the "coverage" across a virus's variation on the percent of genome sequences predicted to be detected. Likewise, when maximizing expected (or average) activity across variation, it treats the different genome sequences uniformly. While this works well if the genome sequences represent a random sample of the targeted viral population, that is often not the case owing to sampling biases. We include sequence weighting in ADAPT, allowing the relative importance of sequences to be set.

To manually set sequence weights when INPUT-TYPE is fasta, use --weight-sequences WEIGHT_SEQUENCES. WEIGHT_SEQUENCES should be a file path to a TSV with two columns: (1) a sequence name that matches to one in the input FASTA; (2) the weight of that sequence. If more than one input FASTA is given, the same number of input TSVs must be given. Each input TSV corresponds to an input FASTA. The input weights will be normalized to sum to 1 and used when calculating objective scores and summary statistics. Any sequence not listed in the input TSV(s) will be assigned, by default, a pre-normalized weight of 1.

When ADAPT designs an assay across multiple subtaxa, each with very different levels of sampling, ADAPT may design deficient assays that only detect a highly overrepresented subtaxon and no other subtaxa. While the number of sequences in the database often indicates a subtaxon's relative importance, it should typically not cause other subtaxa to be ignored in practice.

As a simple correction for this problem, ADAPT includes the argument --weight-by-log-size-of-subtaxa SUBTAXA for when the INPUT-TYPE is auto-from-args or auto-from-file. SUBTAXA is a taxonomic rank ('genus', 'subgenus', 'species', or 'subspecies') lower than the rank of the taxon being designed for. It works as follows:

  1. Each input sequence is associated with one SUBTAXA group.
  2. Each SUBTAXA group is assigned a weight equal to the log of the number of sequences in that group plus 1.
  3. Each sequence is assigned a weight equal to the weight of its SUBTAXA group divided by the number of sequences in its SUBTAXA group.
  4. Weights are normalized across all sequences to sum to 1.

Miscellaneous key arguments

In addition to the arguments above, there are others that are often important when running design.py:

Output

The files output by ADAPT are TSV files, but vary in format depending on SEARCH-TYPE and INPUT-TYPE. There is a separate TSV file for each taxon.

For all cases, run design.py [SEARCH-TYPE] [INPUT-TYPE] --help to see details on the output format and on how to specify paths to the output TSV files.

Complete targets

When SEARCH-TYPE is complete-targets, each row gives an assay design option; there are BEST_N_TARGETS of them. Each design option corresponds to a genomic region (amplicon). The columns give the primer and guide sequences as well as additional information about them. There are about 20 columns; some key ones are:

The rows in the output are sorted by the objective value: better options are on top. Smaller values are better with --obj minimize-guides and larger values are better with --obj maximize-activity.

When INPUT-TYPE is auto-from-file or auto-from-args and ADAPT generates more than one cluster of input sequences, there is a separate TSV file for each cluster; the filenames end in .0, .1, etc.

Sliding window

When SEARCH-TYPE is sliding-window, each row gives a window in the alignment and the columns give information about the guides designed for that window. The columns are:

By default, when SEARCH-TYPE is sliding-window, the rows in the output are sorted by the position of the window. With the --sort argument to design.py, ADAPT sorts the rows so that the "best" choices of windows are on top. It sorts by count (ascending) followed by score (descending), so that windows with the fewest guides and highest score are on top.

Complementarity

Note that output sequences are all in the same sense as the input (target) sequences. Synthesized guide sequences should be reverse complements of the output sequences! Likewise, synthesized primer sequences should account for this.

Examples

Basic: designing within sliding window without predictive model

This is the most simple example. It does not download genomes nor search for genomic regions to target. It also does not use a predictive model of activity, and it seeks to minimize assay complexity rather than maximize activity, which is our usual objective. For these features, see the next example.

The repository includes an alignment of Lassa virus sequences (S segment) from Sierra Leone in examples/SLE_S.aligned.fasta. If you have installed ADAPT via Bioconda or PyPI, you'll need to download the alignment from here. Run:

design.py sliding-window fasta FASTA_PATH -o probes --obj minimize-guides -w 200 -gl 28 -gm 1 -gp 0.95

From this alignment, ADAPT scans each 200 nt window (-w 200) to find the smallest collection of probes that:

ADAPT outputs a file, probes.tsv, that contains the probe sequences for each window. See Output above for a description of this file.

Designing end-to-end with predictive model

ADAPT can automatically download and curate sequences for its design, and search efficiently across the genome to find primers/amplicons as well as Cas13a guides. It identifies Cas13a guides using a pre-trained predictive model of activity.

Run:

design.py complete-targets auto-from-args 64320 None guides --obj maximize-activity -gl 28 -pl 30 -pm 1 -pp 0.95 --predict-cas13a-activity-model --best-n-targets 5 --mafft-path MAFFT_PATH --sample-seqs 50 --verbose

This downloads and designs assays to detect genomes of Zika virus (NCBI taxonomy ID 64320). You must fill in MAFFT_PATH with an executable of MAFFT.

ADAPT designs primers and Cas13a guides within the amplicons, such that:

ADAPT outputs a file, guides.0.tsv, that contains the best 5 design options (--best-n-targets 5) as measured by ADAPT's default objective function. See Output above for a description of this file.

This example randomly selects 50 sequences (--sample-seqs 50) prior to design to speed the runtime in this example; the command should take about 10 minutes to run in full. Using --verbose provides detailed output and is usually recommended, but the output can be extensive.

Note that this example does not enforce specificity.

To instead find minimal guide sets, use --obj minimize-guides instead of --obj maximize-activity and set -gm and -gp. With that alternative objective, Cas13a guides are determined to detect a sequence if they (i) satisfy the number of mismatches specified with -gm and (ii) are predicted by the model to be highly active in detecting the sequence; -gm can be sufficiently high to rely entirely on the predictive model. The output guides will detect a desired fraction of all genomes, as specified by -gp.

Support and contributing

Questions

If you have questions about ADAPT, please create an issue.

Contributing

We welcome contributions to ADAPT. This can be in the form of an issue or pull request.

Citation

ADAPT was started by Hayden Metsky, and is developed by Priya Pillai and Hayden.

If you find ADAPT useful to your work, please cite our paper as:

License

ADAPT is licensed under the terms of the MIT license.

Related repositories

There are other repositories on GitHub associated with ADAPT: