aryeelab / guideseq

Analysis pipeline for the GUIDE-seq assay.
GNU Affero General Public License v3.0
76 stars 53 forks source link

Version Python versions Platforms

guideseq: The GUIDE-Seq Analysis Package

Note that an updated version of this package, including Python 3 support, is maintained by the Tsai Lab: https://github.com/tsailabSJ/guideseq

This repo (aryeelab/guideseq) contains experimental features.


The guideseq package implements our data preprocessing and analysis pipeline for GUIDE-Seq data. It takes raw sequencing reads (FASTQ) and a parameter manifest file (.yaml) as input and produces a table of annotated off-target sites as output.

References

The original paper describing the GUIDE-Seq method:

Tsai SQ, Zheng Z, Nguyen NT, Liebers M, Topkar VV, Thapar V, Wyvekens N, Khayter C, Iafrate AJ, Le LP, Aryee MJ, Joung JK. GUIDE-seq enables genome-wide profiling of off-target cleavage by CRISPR-Cas nucleases. Nat Biotechnol. 2015 Feb;33(2):187-197

A description of this analysis package:

Tsai SQ, Topkar VV, Joung JK, Aryee MJ. Open-source guideseq software for analysis of GUIDE-seq data. Nat Biotechnol. 2016 May 6;34(5):483

Table of Contents

Features

The package implements a pipeline consisting of a read preprocessing module followed by an off-target identification module. The preprocessing module takes raw reads (FASTQ) from a pooled multi-sample sequencing run as input. Reads are demultiplexed into sample-specific FASTQs and PCR duplicates are removed using unique molecular index (UMI) barcode information.

guideseq_flowchart

The individual pipeline steps are:

  1. Sample demultiplexing: A pooled multi-sample sequencing run is demultiplexed into sample-specific read files based on sample-specific dual-indexed barcodes
  2. PCR Duplicate Consolidation:Reads that share the same UMI and the same first six bases of genomic sequence are presumed to originate from the same pre-PCR molecule and are thus consolidated into a single consensus read to improve quantitative interpretation of GUIDE-Seq read counts.
  3. Read Alignment: The demultiplexed, consolidated paired end reads are aligned to a reference genome using the BWA-MEM algorithm with default parameters (Li. H, 2009).
  4. Candidate Site Identification: The start mapping positions of the read amplified with the tag-specific primer (second of pair) are tabulated on a genome-wide basis. Start mapping positions are consolidated using a 10-bp sliding window. Windows with reads mapping to both + and - strands, or to the same strand but amplified with both forward and reverse tag-specific primers, are flagged as sites of potential DSBs. 25 bp of reference sequence is retrieved on either side of the most frequently occuring start-mapping position in each flagged window. The retrieved sequence is aligned to the intended target sequence using a Smith-Waterman local-alignment algorithm.
  5. False positive filtering: Off-target cleavage sites with more than six mismatches to the intended target sequence, or that are present in background controls, are filtered out.
  6. Reporting: Identified off-targets, sorted by GUIDE-Seq read count are annotated in a final output table. The GUIDE-Seq read count is expected to scale approximately linearly with cleavage rates (Tsai et al., Nat Biotechnol. 2015).
  7. Visualization: Alignment of detected off-target sites is visualized via a color-coded sequence grid, as seen below:

guideseq_flowchart

Dependencies

Getting Set Up

Installation

# It's recommended (but not essential) to set up a conda environment to manage dependencies
conda create -n guideseq python=3.8
conda activate guideseq

git clone --recursive https://github.com/aryeelab/guideseq
cd guideseq

pip install -r requirements.txt
python setup.py install

guideseq.py -h

## Please install BWA and bedtools if you choose this option

For both bwa and bedtools, make sure you know the path to the respective executables, as they need to be specified in the pipeline manifest file.

Quickstart

guideseq.py all -m test_manifest.yaml

Running the Full Analysis Pipeline

To run the full guideseq analysis pipeline, you must first create a manifest YAML file that describes all pipeline inputs. Once you have done so, you can simply run

guideseq.py all -m /path/to/manifest.yaml

to run the entire pipeline. Below are specific instructions detailing how to write the manifest file.

If you wish to run an example on our abridged test data, you can simply run


cd guideseq/test

guideseq.py all -m test_manifest.yaml

from the guideseq root directory. The test_manifest assumes that both the bwa and bedtoolsexecutables are in your system PATH. You will see the pipeline results outputted to the test/output folder.

Writing A Manifest File

When running the end-to-end analysis functionality of the guideseq package, a number of inputs are required. To simplify the formatting of these inputs and to encourage reproducibility, these parameters are inputted into the pipeline via a manifest formatted as a YAML file. YAML files allow easy-to-read specification of key-value pairs. This allows us to easily specify our parameters. The following fields are required in the manifest:

An example undemultiplexed field:

undemultiplexed:
    forward: ../test/data/undemux.r1.fastq.gz
    reverse: ../test/data/undemux.r2.fastq.gz
    index1: ../test/data/undemux.i1.fastq.gz
    index2: ../test/data/undemux.i2.fastq.gz

An example samples field:

samples:
    control:
        target:
        barcode1: CTCTCTAC
        barcode2: CTCTCTAT
        description: Control

    [SAMPLENAME]:
        target: GAGTCCGAGCAGAAGAAGAANGG
        barcode1: TAGGCATG
        barcode2: TAGATCGC
        description: EMX1

A Full Manifest File Example

Below is an example of a full manifest file. Feel free to copy it and replace the parameters with your own experiment data. Remember that you can input more than just one treatment sample (e.g. the "EMX1" data below).

reference_genome: test/test_genome.fa
output_folder: test/output

bwa: bwa
bedtools: bedtools
PAM: NGG
demultiplex_min_reads: 1000
window_size: 75
max_mismatches: 7

undemultiplexed:
    forward: test/data/undemultiplexed/undemux.r1.fastq.gz
    reverse: test/data/undemultiplexed/undemux.r2.fastq.gz
    index1: test/data/undemultiplexed/undemux.i1.fastq.gz
    index2: test/data/undemultiplexed/undemux.i2.fastq.gz

samples:
    control:
        target:  
        barcode1: CTCTCTAC
        barcode2: CTCTCTAT
        description: Control

    EMX1:
        target: GAGTCCGAGCAGAAGAAGAANGG
        barcode1: TAGGCATG
        barcode2: TAGATCGC
        description: EMX_site1

Pipeline Output

When running the full pipeline, the results of each step are outputted to the output_folder in a separate folder for each step. The output folders and their respective contents are as follows:

Output Folders

The final detected off-target sites are placed in the output_folder/identified folder, with one .txt file for each sample specified in the manifest. The fields that are populated in each row of these off-target files are specified below:

Output Off-Targets .txt Fields:

The key fields for interpreting this output and identifying off-target sites are: BED off-target Chromosome, BED off-target start, BED off-target end, BED off-target name, BED off-target strand, Off-Target Sequence, bi.sum.mi

Output Visualizations

The outputted visualizations are in the .svg vector format, which is an open image standard that can be viewed in any modern web browser (e.g. Google Chrome, Apple Safari, Mozilla Firefox), and can be viewed and edited in any vector editing application (e.g. Adobe Illustrator). Because the output visualizations are vector images, they can be scaled up or down infinitely without a loss in quality, and can also be edited as shapes with ease. This makes the images produced by the guideseq package ideal for posters, presentations, and papers.

Running Analysis Steps Individually

In addition to end-to-end pipeline analysis functionality, the guideseq package also allows for every step fo the analysis to be run individually. Here we have detailed the required inputs and expected outputs of each step. For each step, we have included a "runnable example" command that can be executed from the guideseq root directory to run that step on the included sample data. These "runnable example" snippets put their output in the test/output folder.

demultiplex Pooled Multi-Sample Sequencing (Manifest Required)

umitag Reads

consolidate PCR Duplicates

align Sites to Genome

identify Off-target Site Candidates

filter Background DSB Sites

visualize Detected Off-Target Sites

Frequently Asked Questions

How do I Run the Pipeline with Demultiplexed Data?

If you already have demultiplexed data, you can run the pipeline on the data by running each step after demultiplexing individually, as described in the "Running Analysis Steps Individually" section above. Be sure to run the individual steps in the following orders:

Can I analyze data without UMIs?

Yes. If your reads do not have UMIs, you can run the pipeline on previously demultiplexed data as described in the "Running Analysis Steps Individually" section above, starting with the align step. Note that we have not analyzed such data ourselves! We suspect that PCR duplication bias may affect the quantitative interpretion of GUIDE-Seq read counts, but have not explored this.

Download Reference Genome

The guideseq package requires a reference genome for read mapping. You can use any genome of your choosing, but for all of our testing and original GUIDE-seq analyses (Tsai et al. Nature Biotechnol 2015) we use hg19 (download). Be sure to (g)unzip the FASTA file before use if it is compressed.

Configuring a MiSeq to Output Index Reads

The guideseq package requires index reads from the MiSeq sequencing run for read consolidation. The default MiSeq Reporter settings do not generate index (I1, I2) reads. This feature can be enabled by adding the line

<add key="CreateFastqForIndexReads" value="1"> 

to the Miseq Reporter.exe.config file located in the Miseq Reporter installation folder. The default installation folder is C:\Illumina\MiSeqReporter. After modifying the config file it should look like this:

<appSettings>
    ... [LEAVE EXISTING LINES UNCHANGED] ...
    <add key="CreateFastqForIndexReads" value="1"> 
</appSettings>

The MiSeq Reporter service needs to be restarted for the change to take effect. Future runs of the GenerateFASTQ workflow (and probably other workflows) will generate I1 and I2 reads in addition to R1 and R2. All four of these reads files will be needed for guideseq analysis.

See page 29 of the Miseq Reporter User Guide for further instructions.