RelocaTE2 is an improved version of RelocaTE (Robb et al., 2013). RelocaTE2 is highly sensitive and accurate in mapping transposable elements (TE) polymorphisms at single base pair resolution. RelocaTE2 uses the reads associated with TEs as seeds to cluster the read pairs on chromosomes. It automatically detects the target site duplication (TSD) of a TE insertion from alignments in each cluster, which enable high resolution mapping of TE polymorphisms. Unlike parallel searching of multi-TE elements in RelocaTE, RelocaTE2 searches all TEs in one cycle, which enables us find polymorphisms of thousands of TEs in an individual genome or large population in a reasonable timeframe without losing sensitivity and specificity.
System requirements
Install
echo "Installation via github"
git clone https://github.com/JinfengChen/RelocaTE2.git
cd RelocaTE2
bash install.sh
cd test_data
bash run_test.sh > run_test.sh.log 2>&1 &
echo "Installation via conda (Linux as example)"
wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh bash Miniconda2-latest-Linux-x86_64.sh source ~/.bashrc
conda create --name RelocaTE2 -c bioconda relocate2
source activate RelocaTE2 cd /PATH_TO_miniconda/env/RelocaTE2 cd test_data bash run_test.sh > run_test.sh.log 2>&1 &
+ Troubleshooting
- Installation of RelocaTE2 using install.sh or conda will install all the tools and packages required to run RelocaTE2. The executables of all tools are RelocaTE2/bin directory and record their paths in RelocaTE2/CONFIG. The main script of RelocaTE2, relocaTE2.py, searches for executables in $PATH; however, the executables from RelocaTE2/CONFIG will supercede $PATH. Users can modify RelocaTE2/CONFIG with paths to tools installed on their specific system to avoid problems.
- The Python module "pysam" is installed to RelocaTE2/lib. By setting PYTHONPATH=PATH\_OF\_RelocaTE2/lib/python2.7/site-packages, any other locally-installed versions of pysam are temporarily ignored and the supported version of pysam for RelocaTE2 is used instead.
- In RelocaTE2, we align trimmed reads to reference genome by bwa v0.6.2, which allows paired-end reads have different names in fastq files. We recommend using install.sh or conda provided in RelocaTE2 to install these dependent tools including bwa v0.6.2.
## Quick Start Quide
- set environment variables if failed to find executable PATH or pysam libary
```shell
export PYTHONPATH=`pwd`/lib/python2.7/site-packages:$PYTHONPATH
export PATH=`pwd`/bin:$PATH
cd test_data
#repeat element
repeat=RiceTE.fa
#reference genome
ref=MSU7.Chr3_2M.fa
#repeatmasker results of TE annotation on reference genome
ref_te=MSU7.Chr3_2M.fa.RepeatMasker.out
#directory where the input fastq format reads are located
fq_dir=MSU7.Chr3_2M.ALL_reads/
#output directory where RelocaTE2 write temperary and final output
outdir=MSU7.Chr3_2M.ALL_reads_RelocaTE2_outdir
python scripts/relocaTE2.py --te_fasta $repeat --genome_fasta $ref --fq_dir $fq_dir --outdir $outdir --reference_ins $ref_te --run
wc -l MSU7.Chr3_2M.ALL_reads_RelocaTE2_outdir/repeat/results/ALL.all_nonref_insert.gff
196
bedtools window -w 10 -a MSU7.Chr3_2M.ALL.gff -b MSU7.Chr3_2M.ALL_reads_RelocaTE2_outdir/repeat/results/ALL.all_nonref_insert.gff | wc -l
196
python scripts/relocaTE2.py --help
usage: relocaTE2.py [-h] [-b BAM] [-t TE_FASTA] [-d FQ_DIR] [-g GENOME_FASTA]
[-r REFERENCE_INS] [-o OUTDIR] [-s SIZE] [-c CPU]
[-1 MATE_1_ID] [-2 MATE_2_ID] [-u UNPAIRED_ID]
[--sample SAMPLE] [--aligner ALIGNER]
[--len_cut_match LEN_CUT_MATCH]
[--len_cut_trim LEN_CUT_TRIM] [--mismatch MISMATCH]
[--mismatch_junction MISMATCH_JUNCTION] [--step STEP]
[--run] [--split] [-v VERBOSE]
optional arguments:
-h, --help show this help message and exit
-b BAM, --bam BAM Name of BAM file of reads mapped reference genome
-t TE_FASTA, --te_fasta TE_FASTA
Name of fasta sequence of repeat element
-d FQ_DIR, --fq_dir FQ_DIR
Name of directory of input fastq sequence data
-g GENOME_FASTA, --genome_fasta GENOME_FASTA
Name of fasta file of reference genome sequence
-r REFERENCE_INS, --reference_ins REFERENCE_INS
Name of RepeatMasker TE annotation of reference genome
-o OUTDIR, --outdir OUTDIR
Name of output directory where to put temperary and
final results
-s SIZE, --size SIZE Insert size of sequence library, default = 500
-c CPU, --cpu CPU Number of CPUs to use for multiplex, default = 1
-1 MATE_1_ID, --mate_1_id MATE_1_ID
string define paired-end read1, default = "_1"
-2 MATE_2_ID, --mate_2_id MATE_2_ID
string define paired-end read2, default = "_2"
-u UNPAIRED_ID, --unpaired_id UNPAIRED_ID
string defining single-end reads, default = ".unPaired"
--sample SAMPLE string defining sample name which will present in output
GFF, default = "not_given"
--aligner ALIGNER aligner used to map reads to repeat elements,
default=blat
--len_cut_match LEN_CUT_MATCH
length cutoff threshold for match between reads and
repeat elements. Large value will lead to less
sensitive but more accuracy, default = 10
--len_cut_trim LEN_CUT_TRIM
length cutoff threshold for trimed reads after
trimming repeat sequence from reads. Large value will
lead to less sensitive but more accuracy, default = 10
--mismatch MISMATCH Number of mismatches allowed for matches between reads
and repeat elements, default = 2
--mismatch_junction MISMATCH_JUNCTION
Number of mismatches allowed for matches between
junction reads and repeat elements, default = 2
--step STEP Number to control steps of pipeline, default =
"1234567"
--dry_run write shell scripts only while this script excute
--run run while this script excute
--split split fastq into 1 million reads chunks to run blat/bwa jobs
-v VERBOSE, --verbose VERBOSE
verbose grade to print out information in all scripts:
range from 0 to 4, default = 2
cat test_data/MSU7.Chr3.fa | head -n 25
>Chr3
#########Skip lines of NNNNNN in sequence#########
CCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACC
CTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCCTAAACCC
TAAACCCTAAACCCTAAACCCTAAACCCTCAATCCTAACCCCTCAATCCT
AAGCCGTCTGCAGTGGCCATCGCAGGGTGAGGTGGGTGGAGGGTGAGTGA
cat test_data/RiceTE.fa | head -n 4
>mPing
GGCCAGTCACAATGGGGGTTTCACTGGTGTGTCATGCACATTTAATAGGGGTAAGACTGAATAAAAAATG
ATTATTTGCATGAAATGGGGATGAGAGAGAAGGAAAGAGTTTCATCCTGGTGAAACTCGTCAGCGTCGTT
TCCAAGTCCTCGGTAACAGAGTGAAACCCCCGTTGAGGCCGATTCGTTTCATTCACCGGATCTCTTGCGT
ls test_data/MSU7.Chr3.ALL.rep1_reads_2X_100_500/*.fq
test_data/MSU7.Chr3.ALL.rep1_reads_2X_100_500/MSU7.Chr3.ALL.rep1_reads_2X_100_500_1.fq
test_data/MSU7.Chr3.ALL.rep1_reads_2X_100_500/MSU7.Chr3.ALL.rep1_reads_2X_100_500_2.fq
cat test_data/MSU7.Chr3.ALL.rep1_reads_2X_100_500/MSU7.Chr3.ALL.rep1_reads_2X_100_500_1.fq | head -n 4 @read_500_1/1 TGTAAGAGTGCTATTGATGTTCGTTAATATTGTGCTCATATTTATAACATAATGATTTCTTTCATCTATACGAACAAATAGTACAAAATCCAAATACGAC + GGGGGGFCCFGHEEE:?<8;C:;@<6=@BFGHEEHHHGGEBEEEEFFH?F;<<@HHFECFFHHHGFFFHHHFBE<BCDCDFHFHHHHHGBHHEED3;DHH
cat test_data/MSU7.Chr3.ALL.rep1_reads_2X_100_500/MSU7.Chr3.ALL.rep1_reads_2X_100_500_2.fq | head -n 4 @read_500_1/2 ATTAATAATTTAAAATCTATATTAACTAATGTACACTTACACTAGAACCGGTGGACCCATTATTTAAATCATCTGATATATTATTTGCTAATAAATAAAG + GGGFHHHHHHHGGGHHHHHHHEHFFEHFHHGHHHDHHHGHFFGEHHFHHHFGDC@=@GFGB4DEHHHECE>>8/@ABHHH?DGHBEFEDAEDHFBDBCC>
+ RepeatMasker results of TE annotation on reference genome (ref_te): default TE annotation of reference genome used to call TE insertions in reference genome and remove possible false positive non-reference TE insertions.
```shell
cat test_data/MSU7.Chr3.fa.RepeatMasker.out | head -n 4
249 23.8 0.0 4.8 Chr3 1283 1366 (36412453) + CACTA-K DNA/En-Spm 663 742 (389) 1
2250 19.7 0.7 0.9 Chr3 1973 2424 (36411395) + SZ-7_LTR LTR/Gypsy 1 451 (131) 2
4798 12.5 5.6 1.0 Chr3 2425 3618 (36410201) + SZ_LTR LTR/Gypsy 1 1249 (0) 3
820 15.4 0.0 0.0 Chr3 3619 3754 (36410065) + SZ-33_LTR LTR/Gypsy 440 575 (0) 4
Structure of output directory
TE insertions
GFF format used in RelocaTE2
cat test_data/MSU7.Chr3.ALL.rep1_reads_2X_100_500_RelocaTE2_outdir/repeat/results/ALL.all_nonref_insert.gff | head -n 4
Chr3 RelocaTE2 rice 85173 85180 . - . ID=repeat_Chr3_85173_85180;Name=nDaiz;TSD=GTGCTTGT;Note=Non-reference, not found in reference;Right_junction_reads=1;Left_junction_reads=2;Right_support_reads=3;Left_support_reads=4;
Chr3 RelocaTE2 rice 239462 239464 . + . ID=repeat_Chr3_239462_239464;Name=Truncator;TSD=supporting_junction;Note=Non-reference, not found in reference;Right_junction_reads=2;Left_junction_reads=0;Right_support_reads=4;Left_support_reads=2;
Chr3 RelocaTE2 rice 429710 429717 . + . ID=repeat_Chr3_429710_429717;Name=nDaiz;TSD=CAAAATTT;Note=Non-reference, not found in reference;Right_junction_reads=2;Left_junction_reads=2;Right_support_reads=2;Left_support_reads=6;
Chr3 RelocaTE2 rice 522905 522907 . + . ID=repeat_Chr3_522905_522907;Name=Dasheng;TSD=supporting_junction;Note=Non-reference, not found in reference;Right_junction_reads=0;Left_junction_reads=3;Right_support_reads=2;Left_support_reads=0;
Attributes in field 8 of GFF:
ID: unique id of TE insertions, repeat_"chromosome"_"start"_"end"
Name: TE family name of this insertion
TSD: target site duplicate predicted from read alignments
Note: definition of TE insertions, including Non-reference, Reference-Only and Shared.
Right_junction_reads: number of reads covering the junction of TE insertion on right side/downstream.
Left_junction_reads: number of reads covering the junction of TE insertion on left side/upstream.
Right_support_reads: number of reads not covering the junction of TE insertion, but supporting TE insertion by paired-end reads on right side/downstream.
Left_support_reads: number of reads not covering the junction of TE insertion, but supporting TE insertion by paired-end reads on left side/downstream.