bergmanlab / ngs_te_mapper2

Software for detecting transposable element insertions from next-generation sequencing data
BSD 2-Clause "Simplified" License
9 stars 1 forks source link

ngs_te_mapper2: A program to identify transposable element insertions using next generation sequencing data

Table of Contents

Introduction

ngs_te_mapper2 is a method for detecting transposable element (TE) insertions from short-read next-generation sequencing (NGS) data described in Han et al. (2021) Genetics 219(2):iyab113. ngs_te_mapper2 is a Python re-implementation of the ngs_te_mapper method originally described in Linheiro and Bergman (2012) PLoS ONE 7(2): e30008. ngs_te_mapper2 uses a three-stage procedure to annotate non-reference TEs as the span of target site duplication (TSD), following the framework described in Bergman (2012) Mob Genet Elements. 2:51-54.

ngs_te_mapper2 is written in python3 and is designed to run on a Linux operating system.

Installation

Install Miniconda

To install ngs_te_mapper2, the recommended way is using conda. If your system doesn't have conda installed, you could use following steps to install Miniconda (Python 3.X). For more on Conda: see the Conda User Guide.

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O $HOME//miniconda.sh
bash ~/miniconda.sh -b -p $HOME/miniconda # silent mode
echo "export PATH=\$PATH:\$HOME/miniconda/bin" >> $HOME/.bashrc # add to .bashrc
source $HOME/.bashrc

conda init # this step requires you to close and open a new terminal before it take effect
conda update conda # update conda

Install ngs_te_mapper2 using conda

ngs_te_mapper2 and all software dependencies can be installed using conda.

# We recommended installing ngs_te_mapper2 in a new conda environment
conda create -n ngs_te_mapper2 --channel bioconda ngs_te_mapper2

# Alternatively, you can install ngs_te_mapper2 in current active environment
conda install --channel bioconda ngs_te_mapper2

Run ngs_te_mapper2 on test dataset

A test dataset is provided in the test/ directory, you can test whether your ngs_te_mapper2 installation is successful by running ngs_te_mapper2 on this dataset, which should take less than one minute to finish on a single thread machine.

conda activate ngs_te_mapper2
cd test
ngs_te_mapper2 -o test_output -f reads.fastq -r ref_1kb.fasta -l library.fasta

NOTE: Sometimes activating conda environments does not work via conda activate myenv when run through a script submitted to a queueing system, this can be fixed by activating the environment in the script as shown below

CONDA_BASE=$(conda info --base)
source ${CONDA_BASE}/etc/profile.d/conda.sh
conda activate ngs_te_mapper2

Usage

ngs_te_mapper2 required input files

Command line help page

usage: ngs_te_mapper2 [-h] -f READS -l LIBRARY -r REFERENCE [-a ANNOTATION]
                         [-n REGION] [-w WINDOW] [--min_mapq MIN_MAPQ]
                         [--min_af MIN_AF] [--tsd_max TSD_MAX]
                         [--gap_max GAP_MAX] [-m MAPPER] [-t THREAD] [-o OUT]
                         [-p PREFIX] [-k]

Script to detect non-reference TEs from single end short read data

required arguments:
  -f READS, --reads READS
                        raw reads in fastq or fastq.gz format, separated by
                        comma
  -l LIBRARY, --library LIBRARY
                        TE concensus sequence
  -r REFERENCE, --reference REFERENCE
                        reference genome

optional arguments:
  -h, --help            show this help message and exit
  -a ANNOTATION, --annotation ANNOTATION
                        reference TE annotation in GFF3 format (must have
                        'Target' attribute in the 9th column)
  -w WINDOW, --window WINDOW
                        merge window for identifying TE clusters (default =
                        10)
  --min_mapq MIN_MAPQ   minimum mapping quality of alignment (default = 20)
  --min_af MIN_AF       minimum allele frequency (default = 0.1)
  --tsd_max TSD_MAX     maximum TSD size (default = 25)
  --gap_max GAP_MAX     maximum gap size (default = 5)
  -t THREAD, --thread THREAD
                        thread (default = 1)
  -o OUT, --out OUT     output dir (default = '.')
  -p PREFIX, --prefix PREFIX
                        output prefix
  -k, --keep_files      If provided then all intermediate files will be kept
                        (default: remove intermediate files)

Note: The optional reference TE annotation input should in theory speed up the program. ngs_te_mapper2 expects the TE annotation to be in GFF3 format and Target attribute must be included in the 9th column that represents TE family name. If you have *.out annotation generated by RepeatMasker, you can use this utility script to convert from *.out to GFF3 format.

Output

ngs_te_mapper2 outputs reference and non-referece TE insertion predictions in BED format (0-based).

TE insertion annotation in bed format

ngs_te_mapper2 generates standard BED file <sample>.nonref.bed and <sample>.ref.bed that have detailed information for each reference and non-reference TE insertion.

Column Description
chromosome The chromosome name where the TE insertion occurred
position Starting breakpoint position of the TE insertions.
end Ending breakpoint position of the TE insertions.
info Includes TE family, TSD, Allele Frequency, 3' support, 5' support and reference reads. Separated by '|'.
score '.'
strand Strand that TE insertion occurs

Log file output by ngs_te_mapper2

For each ngs_te_mapper2 run, a log file called <sample>.log is generated that records all the major steps in the program and error messages.

Getting help

Please use the Github Issue page if you have questions.

Citation

To cite ngs_te_mapper2 in publications, please use:

S. Han, P.J. Basting, G.B. Dias, A. Luhur, A.C. Zelhof, C.M. Bergman (2021) Transposable element profiles reveal cell line identity and loss of heterozygosity in Drosophila cell culture. Genetics 219(2):iyab113