bergmanlab / mcclintock

Meta-pipeline to identify transposable element insertions using next generation sequencing data
90 stars 31 forks source link

McClintock in action

McClintock: A meta-pipeline to identify transposable element insertions using short-read whole genome sequencing data

Getting Started

# INSTALL (Requires Conda and Mamba to be installed)
git clone git@github.com:bergmanlab/mcclintock.git
cd mcclintock
mamba env create -f install/envs/mcclintock.yml --name mcclintock
conda activate mcclintock
python3 mcclintock.py --install
python3 test/download_test_data.py

# RUN
python3 mcclintock.py \
    -r test/sacCer2.fasta \
    -c test/sac_cer_TE_seqs.fasta \
    -g test/reference_TE_locations.gff \
    -t test/sac_cer_te_families.tsv \
    -1 test/SRR800842_1.fastq.gz \
    -2 test/SRR800842_2.fastq.gz \
    -p 4 \
    -o /path/to/output/directory

Table of Contents

Introduction

Many methods have been developed to detect transposable element (TE) insertions from short-read whole genome sequencing (WGS) data, each of which has different dependencies, run interfaces, and output formats. McClintock provides a meta-pipeline to reproducibly install, execute, and evaluate multiple TE detectors, and generate output in standardized output formats. A description of the original McClintock 1 pipeline and evaluation of the original six TE detectors on the yeast genome can be found in Nelson, Linheiro and Bergman (2017) G3 7:2763-2778. A description of the re-implemented McClintock 2 pipeline, the reproducible simulation system, and evaluation of 12 TE detectors on the yeast genome can be found in Chen, Basting, Han, Garfinkel and Bergman (2023) Mobile DNA 14:8. The set of TE detectors currently included in McClintock 2 are:

Installing Conda/Mamba

McClintock is written in Python3 leveraging the SnakeMake workflow system and is designed to run on linux operating systems. Installation of software dependencies for McClintock and its component methods is automated by Conda, thus a working installation of Conda is required to install McClintock. Conda can be installed via the Miniconda installer.

Installing Miniconda (Python 3.X)

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O $HOME//miniconda.sh
bash ~/miniconda.sh -b -p $HOME/miniconda # silent mode
echo "export PATH=\$PATH:\$HOME/miniconda/bin" >> $HOME/.bashrc # add to .bashrc
source $HOME/.bashrc
conda init

Update Conda

conda update -y conda

Install Mamba

conda install -c conda-forge mamba

Installing McClintock

After installing and updating Conda/Mamba, McClintock and its component methods can be installed by: 1. cloning the repository, 2. creating the Conda environment, and 3. running the install script.

Clone McClintock Repository

git clone git@github.com:bergmanlab/mcclintock.git
cd mcclintock

Create McClintock Conda Environment

mamba env create -f install/envs/mcclintock.yml --name mcclintock

Activate McClintock Conda Environment

conda activate mcclintock

Install McClintock Component Methods

McClintock Usage

Running the complete McClintock pipeline requires a fasta reference genome (option -r), a set of TE consensus/canonical sequences present in the organism (option -c), and fastq paired-end sequencing reads (options -1 and -2). If only single-end fastq sequencing data are available, then this can be supplied using only option -1, however only the TE detectors that handle single-ended data will be run. Optionally, if a detailed annotation of TE sequences in the reference genome has been performed, a GFF file with annotated reference TEs (option -g) and a tab-delimited "taxonomy" file linking annotated insertions to their TE family (option -t) can be supplied. Example input files are included in the test directory.

##########################
##       Required       ##
##########################
  -r, --reference REFERENCE
                        A reference genome sequence in fasta format
  -c, --consensus CONSENSUS
                        The consensus sequences of the TEs for the species in
                        fasta format
  -1, --first FIRST
                        The path of the first fastq file from paired end read
                        sequencing or the fastq file from single read
                        sequencing

##########################
##       Optional       ##
##########################
  -h, --help            show this help message and exit
  -2, --second SECOND
                        The path of the second fastq file from a paired end
                        read sequencing
  -p, --proc PROC       The number of processors to use for parallel stages of
                        the pipeline [default = 1]
  -o, --out OUT         An output folder for the run. [default = '.']
  -m, --methods METHODS
                        A comma-delimited list containing the software you
                        want the pipeline to use for analysis. e.g. '-m
                        relocate,TEMP,ngs_te_mapper' will launch only those
                        three methods. If this option is not set, all methods
                        will be run [options: ngs_te_mapper, ngs_te_mapper2, 
                        relocate, relocate2, temp, temp2, retroseq, 
                        popoolationte, popoolationte2, te-locate, teflon, 
                        coverage, trimgalore, map_reads, tebreak]

  -g, --locations LOCATIONS
                        The locations of known TEs in the reference genome in
                        GFF 3 format. This must include a unique ID attribute
                        for every entry. If this option is not set, a file of 
                        reference TE locations in GFF format will be produced 
                        using RepeatMasker
  -t, --taxonomy TAXONOMY
                        A tab delimited file with one entry per ID in the GFF
                        file and two columns: the first containing the ID and
                        the second containing the TE family it belongs to. The
                        family should correspond to the names of the sequences
                        in the consensus fasta file. If this option is not set, 
                        a file mapping reference TE instances to TE families 
                        in TSV format will be produced using RepeatMasker
  -s, --coverage_fasta COVERAGE_FASTA
                        A fasta file that will be used for TE-based coverage
                        analysis, if not supplied then the consensus sequences
                        of the TEs set by -c/--consensus will be used for the 
                        analysis
  -a, --augment AUGMENT
                        A fasta file of TE sequences that will be included as
                        extra chromosomes in the reference file (useful if the
                        organism is known to have TEs that are not present in
                        the reference strain)
  -k, --keep_intermediate KEEP_INTERMEDIATE
                        This option determines which intermediate files are 
                        preserved after McClintock completes [default: general]
                        [options: minimal, general, methods, <list,of,methods>, 
                        all]
  -s, --sample_name SAMPLE_NAME
                        The sample name to use for output files [default: 
                        fastq1 name]
  -n, --config CONFIG   This option determines which config files to use for 
                        your McClintock run [default: config in McClintock 
                        Repository]
  -v, --vcf VCF         This option determines which format of VCF output will 
                        be created [default: siteonly][options: siteonly,sample]
  --install             This option will install the dependencies of McClintock
  --resume              This option will attempt to use existing intermediate 
                        files from a previous McClintock run
  --debug               This option will allow snakemake to print progress to 
                        stdout
  --serial              This option runs without attempting to optimize thread 
                        usage to run rules concurrently. Each multithread rule 
                        will use the max processors designated by -p/--proc
  --make_annotations    This option will only run the pipeline up to the 
                        creation of the repeat annotations
  --comments            If this option is specified then fastq comments (e.g.
                        barcode) will be incorporated to SAM output. Warning:
                        do not use this option if the input fastq files do not
                        have comments

Mcclintock Input Files

Warning

Required

McClintock Output

The results of McClintock component methods are output to the directory <output>/<sample>/results.

HTML Summary Report: <output>/<sample>/results/summary/

Raw Summary files : <output>/<sample>/results/summary/

TrimGalore : <output>/<sample>/results/trimgalore/

Coverage : <output>/<sample>/results/coverage/

ngs_te_mapper : <output>/<sample>/results/ngs_te_mapper/

ngs_te_mapper2 : <output>/<sample>/results/ngs_te_mapper2/

PoPoolationTE : <output>/<sample>/results/popoolationTE/

PoPoolationTE2 : <output>/<sample>/results/popoolationTE2/

RelocaTE : <output>/<sample>/results/relocaTE/

RelocaTE2 : <output>/<sample>/results/relocaTE2/

RetroSeq : <output>/<sample>/results/retroseq/

TEbreak : <output>/<sample>/results/tebreak/

TEMP : <output>/<sample>/results/TEMP/

TEMP2 : <output>/<sample>/results/temp2/

TE-Locate : <output>/<sample>/results/te-locate/

TEFLoN : <output>/<sample>/results/teflon/

Run Examples

Running McClintock with test data

This repository also provides test data to ensure your McClintock installation is working. Test data can be found in the test/ directory which includes a yeast reference genome (UCSC sacCer2) and an annotation of TEs in this version of the yeast genome from Carr et al. (2012). A pair of fastq files can be downloaded from SRA using the test/download_test_data.py script:

python test/download_test_data.py
----------------------------------
MAPPED READ INFORMATION
----------------------------------
read1 sequence length:  94
read2 sequence length:  94
read1 reads:            18547818
read2 reads:            18558408
median insert size:     302
avg genome coverage:    268.185
----------------------------------

-----------------------------------------------------
METHOD          ALL       REFERENCE    NON-REFERENCE 
-----------------------------------------------------
ngs_te_mapper   35        21           14            
ngs_te_mapper2  87        49           38            
relocate        80        63           17            
relocate2       139       41           98            
temp            365       311          54            
temp2           367       311          56            
retroseq        58        0            58            
popoolationte   141       130          11            
popoolationte2  186       164          22            
te-locate       713       164          549           
teflon          414       390          24            
tebreak         60        0            60            
-----------------------------------------------------

Running McClintock with specific component methods

Running McClintock with multiple samples using same reference genome

python3 mcclintock.py \
    -r test/sacCer2.fasta \
    -c test/sac_cer_TE_seqs.fasta \
    -1 /path/to/sample1_1.fastq.gz \
    -2 /path/to/sample1_2.fastq.gz \
    -p 4 \
    -o <output> \
    --resume

python3 mcclintock.py \
    -r test/sacCer2.fasta \
    -c test/sac_cer_TE_seqs.fasta \
    -1 /path/to/sample2_1.fastq.gz \
    -2 /path/to/sample2_2.fastq.gz \
    -p 4 \
    -o <output> \
    --resume

## etc ##

Citation

To cite McClintock 1, the general TE detector meta-pipeline concept, or the single synthetic insertion simulation framework, please use: Nelson, M.G., R.S. Linheiro & C.M. Bergman (2017) McClintock: An integrated pipeline for detecting transposable element insertions in whole genome shotgun sequencing data. G3. 7:2763-2778.

To cite McClintock 2 or the reproducible simulation system, please use: Chen, J., P.J. Basting, S. Han, D.J. Garfinkel & C.M. Bergman (2023) Reproducible evaluation of transposable element detectors with McClintock 2 guides accurate inference of Ty insertion patterns in yeast Mobile DNA 14:8.

License


Copyright 2014-2023 Preston Basting, Jingxuan Chen, Shunhua Han, Michael G. Nelson, and Casey M. Bergman

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.