JinfengChen / RelocaTE2

RelocaTE2
MIT License
14 stars 7 forks source link

RelocaTE2: a high resolution transposable element insertion sites mapping tool for population resequencing

Introduction

RelocaTE2 is an improved version of RelocaTE (Robb et al., 2013). RelocaTE2 is highly sensitive and accurate in mapping transposable elements (TE) polymorphisms at single base pair resolution. RelocaTE2 uses the reads associated with TEs as seeds to cluster the read pairs on chromosomes. It automatically detects the target site duplication (TSD) of a TE insertion from alignments in each cluster, which enable high resolution mapping of TE polymorphisms. Unlike parallel searching of multi-TE elements in RelocaTE, RelocaTE2 searches all TEs in one cycle, which enables us find polymorphisms of thousands of TEs in an individual genome or large population in a reasonable timeframe without losing sensitivity and specificity.

RelocaTE2 Workflow

Installation

echo "Installation via conda (Linux as example)"

download miniconda from http://conda.pydata.org/miniconda.html and install

wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh bash Miniconda2-latest-Linux-x86_64.sh source ~/.bashrc

install RelocaTE2 into miniconda isolated environment "RelocaTE2"

conda create --name RelocaTE2 -c bioconda relocate2

run test data

source activate RelocaTE2 cd /PATH_TO_miniconda/env/RelocaTE2 cd test_data bash run_test.sh > run_test.sh.log 2>&1 &


+ Troubleshooting
  - Installation of RelocaTE2 using install.sh or conda will install all the tools and packages required to run RelocaTE2. The executables of all tools are RelocaTE2/bin directory and record their paths in RelocaTE2/CONFIG. The main script of RelocaTE2, relocaTE2.py, searches for executables in $PATH; however, the executables from RelocaTE2/CONFIG will supercede $PATH. Users can modify RelocaTE2/CONFIG with paths to tools installed on their specific system to avoid problems. 
  - The Python module "pysam" is installed to RelocaTE2/lib. By setting PYTHONPATH=PATH\_OF\_RelocaTE2/lib/python2.7/site-packages, any other locally-installed versions of pysam are temporarily ignored and the supported version of pysam for RelocaTE2 is used instead.
  - In RelocaTE2, we align trimmed reads to reference genome by bwa v0.6.2, which allows paired-end reads have different names in fastq files. We recommend using install.sh or conda provided in RelocaTE2 to install these dependent tools including bwa v0.6.2.

## Quick Start Quide
  - set environment variables if failed to find executable PATH or pysam libary 
```shell
export PYTHONPATH=`pwd`/lib/python2.7/site-packages:$PYTHONPATH
export PATH=`pwd`/bin:$PATH 

RelocaTE2 Command Line Options

python scripts/relocaTE2.py --help
usage: relocaTE2.py [-h] [-b BAM] [-t TE_FASTA] [-d FQ_DIR] [-g GENOME_FASTA]
                   [-r REFERENCE_INS] [-o OUTDIR] [-s SIZE] [-c CPU]
                   [-1 MATE_1_ID] [-2 MATE_2_ID] [-u UNPAIRED_ID]
                   [--sample SAMPLE] [--aligner ALIGNER]
                   [--len_cut_match LEN_CUT_MATCH]
                   [--len_cut_trim LEN_CUT_TRIM] [--mismatch MISMATCH]
                   [--mismatch_junction MISMATCH_JUNCTION] [--step STEP]
                   [--run] [--split] [-v VERBOSE]

optional arguments:
  -h, --help            show this help message and exit
  -b BAM, --bam BAM     Name of BAM file of reads mapped reference genome
  -t TE_FASTA, --te_fasta TE_FASTA
                        Name of fasta sequence of repeat element
  -d FQ_DIR, --fq_dir FQ_DIR
                        Name of directory of input fastq sequence data
  -g GENOME_FASTA, --genome_fasta GENOME_FASTA
                        Name of fasta file of reference genome sequence
  -r REFERENCE_INS, --reference_ins REFERENCE_INS
                        Name of RepeatMasker TE annotation of reference genome
  -o OUTDIR, --outdir OUTDIR
                        Name of output directory where to put temperary and
                        final results
  -s SIZE, --size SIZE  Insert size of sequence library, default = 500
  -c CPU, --cpu CPU     Number of CPUs to use for multiplex, default = 1
  -1 MATE_1_ID, --mate_1_id MATE_1_ID
                        string define paired-end read1, default = "_1"
  -2 MATE_2_ID, --mate_2_id MATE_2_ID
                        string define paired-end read2, default = "_2"
  -u UNPAIRED_ID, --unpaired_id UNPAIRED_ID
                        string defining single-end reads, default = ".unPaired"
  --sample SAMPLE       string defining sample name which will present in output
                        GFF, default = "not_given"
  --aligner ALIGNER     aligner used to map reads to repeat elements,
                        default=blat
  --len_cut_match LEN_CUT_MATCH
                        length cutoff threshold for match between reads and
                        repeat elements. Large value will lead to less
                        sensitive but more accuracy, default = 10
  --len_cut_trim LEN_CUT_TRIM
                        length cutoff threshold for trimed reads after
                        trimming repeat sequence from reads. Large value will
                        lead to less sensitive but more accuracy, default = 10
  --mismatch MISMATCH   Number of mismatches allowed for matches between reads
                        and repeat elements, default = 2
  --mismatch_junction MISMATCH_JUNCTION
                        Number of mismatches allowed for matches between
                        junction reads and repeat elements, default = 2
  --step STEP           Number to control steps of pipeline, default =
                        "1234567"
  --dry_run             write shell scripts only while this script excute
  --run                 run while this script excute
  --split               split fastq into 1 million reads chunks to run blat/bwa jobs
  -v VERBOSE, --verbose VERBOSE
                        verbose grade to print out information in all scripts:
                        range from 0 to 4, default = 2

RelocaTE2 input

cat test_data/MSU7.Chr3.ALL.rep1_reads_2X_100_500/MSU7.Chr3.ALL.rep1_reads_2X_100_500_1.fq | head -n 4 @read_500_1/1 TGTAAGAGTGCTATTGATGTTCGTTAATATTGTGCTCATATTTATAACATAATGATTTCTTTCATCTATACGAACAAATAGTACAAAATCCAAATACGAC + GGGGGGFCCFGHEEE:?<8;C:;@<6=@BFGHEEHHHGGEBEEEEFFH?F;<<@HHFECFFHHHGFFFHHHFBE<BCDCDFHFHHHHHGBHHEED3;DHH

cat test_data/MSU7.Chr3.ALL.rep1_reads_2X_100_500/MSU7.Chr3.ALL.rep1_reads_2X_100_500_2.fq | head -n 4 @read_500_1/2 ATTAATAATTTAAAATCTATATTAACTAATGTACACTTACACTAGAACCGGTGGACCCATTATTTAAATCATCTGATATATTATTTGCTAATAAATAAAG + GGGFHHHHHHHGGGHHHHHHHEHFFEHFHHGHHHDHHHGHFFGEHHFHHHFGDC@=@GFGB4DEHHHECE>>8/@ABHHH?DGHBEFEDAEDHFBDBCC>

+ RepeatMasker results of TE annotation on reference genome (ref_te): default TE annotation of reference genome used to call TE insertions in reference genome and remove possible false positive non-reference TE insertions.
```shell
cat test_data/MSU7.Chr3.fa.RepeatMasker.out | head -n 4
  249   23.8  0.0  4.8  Chr3          1283     1366 (36412453) + CACTA-K             DNA/En-Spm             663    742   (389)     1  
 2250   19.7  0.7  0.9  Chr3          1973     2424 (36411395) + SZ-7_LTR            LTR/Gypsy                1    451   (131)     2  
 4798   12.5  5.6  1.0  Chr3          2425     3618 (36410201) + SZ_LTR              LTR/Gypsy                1   1249     (0)     3  
  820   15.4  0.0  0.0  Chr3          3619     3754 (36410065) + SZ-33_LTR           LTR/Gypsy              440    575     (0)     4

RelocaTE2 output

ID: unique id of TE insertions, repeat_"chromosome"_"start"_"end"

Name: TE family name of this insertion

TSD: target site duplicate predicted from read alignments

Note: definition of TE insertions, including Non-reference, Reference-Only and Shared.

Right_junction_reads: number of reads covering the junction of TE insertion on right side/downstream.

Left_junction_reads: number of reads covering the junction of TE insertion on left side/upstream.

Right_support_reads: number of reads not covering the junction of TE insertion, but supporting TE insertion by paired-end reads on right side/downstream.

Left_support_reads: number of reads not covering the junction of TE insertion, but supporting TE insertion by paired-end reads on left side/downstream.

Publications

  1. Chen, J., Wrightsman T, Wessler SR, Stajich JE. (2017) RelocaTE2: a high resolution transposable element insertion site mapping tool for population resequencing. PeerJ 5:e2942.
  2. Robb S.M., Lu L., Valencia E., Burnette J.M. 3rd., Okumoto Y., Wessler S.R., Stajich J.E. (2013) The use of RelocaTE and unassembled short reads to produce high-resolution snapshots of transposable element generated diversity in rice. G3 2013;3:949-957.