Read about TEspeX in:
Ansaloni et al., Bioinformatics, 2022: https://doi.org/10.1093/bioinformatics/btac526
Ansaloni et al., BMC Bioinformatics, 2019: https://doi.org/10.1186/s12859-019-3088-7
TEspeX (Transposable Elements SPEcific eXpression) is a tool for the TE expression quantification from Illummina short RNA-seq reads. The rationale of the pipeline is to map the reads against a reference transcriptome composed by i) TE consensus sequences, ii) coding transcripts and iii) non-coding transcripts and to quantify the expression of TEs avoiding counting reads deriving from TE fragments embedded in coding/non-coding non-TE annotated transcripts.
Techincally, TEspeX:
Possible scenarios:
The pipeline has been written in python3 and it has been tested on Ubuntu, CentOS and Mac OS X systems.
0. Prerequisites
Please note how this manual assumes you have already installed and in $PATH:
You can easily check this by typing which <name>
. If they are not installed, please install them before installing TEspeX.
1. conda
which conda
which activate
which deactivate
If paths are printed, you can directly go to 2., otherwise follow the steps below:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
conda_dir=</path/to/miniconda3/>
echo "export PATH=\${PATH}:${conda_dir}/bin" >> ~/.bashrc
source ~/.bashrc
which conda
2. java (JDK)
java -version
If this prints something like:
openjdk version "11.0.11" 2021-04-20
OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.18.04)
OpenJDK 64-Bit Server VM (build 11.0.11+9-Ubuntu-0ubuntu2.18.04, mixed mode, sharing)
Skip to 3.; otherwise:
conda install -c anaconda openjdk
3. git
which git
If the git version is printed you can directly go to 4., otherwise type:
conda install git
4. zlib
It is likely that you already have zlib installed and properly configured somewhere. To check this type:
echo $LD_LIBRARY_PATH
echo $CFLAGS
echo $LDFLAGS
If paths to zlib lib and include folders are returned, everything should be OK - skip to 5.
Otherwise:
cd </path/to/miniconda3/>
wget https://www.zlib.net/fossils/zlib-1.2.11.tar.gz
tar -zxvf zlib-1.2.11.tar.gz
cd zlib-1.2.11/
./configure --prefix=$PWD/packages
make
make install
export LD_LIBRARY_PATH=$PWD/packages/lib/:$LD_LIBRARY_PATH
export CFLAGS="-I$PWD/packages/include"
export LDFLAGS="-L$PWD/packages/lib"
5. clone TEspeX and create working directory variable
cd </path/where/install/TEspeX>
git clone https://github.com/fansalon/TEspeX
cd TEspeX/
tespex=$PWD
6. Picard
A file called 'picard.jar' is contained in the 'bin/picard' directory.\ To check whether java is properly installed on your machine and picard properly works, type:
java -jar $tespex/bin/picard/picard.jar
If the picard help is printed everything is fine. If an error rises java is not (properly) installed on your machine. Possible solutions: i) go back to 2. and check conda has successfully installed java, ii) check that the java installed by conda is in $PATH and iii) check that the invoked java is really the one installed by conda.
7. STAR
STAR executables should be in the 'bin/STAR-2.6.0c' directory.\ To check whether STAR is properly working, type:
$tespex/bin/STAR-2.6.0c/STAR --version
this should return:
STAR_2.6.0c
8. samtools
install samtools-1.3.1
cd $tespex/bin
wget -O samtools-1.3.1.tar.bz2 https://github.com/samtools/samtools/releases/download/1.3.1/samtools-1.3.1.tar.bz2
tar xjf samtools-1.3.1.tar.bz2
cd samtools-1.3.1
./configure --prefix=$PWD/
(if during the --configure step you encounter an error like: "configure: error: curses development files not found" please relaunch the command adding the "--without-curses" flag)
make
make install
bin/samtools --version
this should return something like:
samtools 1.3.1
Using htslib 1.3.1
Copyright (C) 2016 Genome Research Ltd.
9. create a conda environment where the required version of python, pysam and pandas are installed
To create the conda environment and install the required libraries type:
conda create -n TEspeX_deps --override-channels -c bioconda -c defaults python=3.6 pandas=0.23.0 pysam'>=0.15.0,<=0.15.1'
## --> you will be asked to let conda download and install new packages: type Y
# activate the environment - to be done every time TEspeX is used
# **PLEASE DO NOT CHANGE THE ENVIRONMENT NAME AS IT WILL COMPROMISE THE FUNCTIONING OF THE PIPELINE**
source activate TEspeX_deps
# to check everything properly worked
which python3
## --> /path/to/envs/TEspeX_deps/bin/python3
which pip3
## --> /path/to/envs/TEspeX_deps/bin/pip3
python3 --version
## --> Python 3.6.13 :: Anaconda, Inc.
pip3 --version
## --> pip 21.1.3 from /path/to/envs/TEspeX_deps/lib/python3.6/site-packages/pip (python 3.6)
# deactivate the environment
conda deactivate
10. Test TEspeX has been successfully installed
In the folder 'example' a copy of the files used to perform the TE expression analysis on 2 C. elegans embryonic fastq files (Tintori SC, et al. - Dev. Cell - 2016 - https://www.ncbi.nlm.nih.gov/pubmed/27554860) is deposited. Please, please, DO test whether the pipeline is working properly following the steps below:
cd $tespex
ls $tespex/example/*.fastq.gz > $tespex/example/reads.txt
source activate TEspeX_deps
python3 TEspeX.py --TE example/ce.Dfam.fa.gz \
--cdna example/Caenorhabditis_elegans.WBcel235.cdna.all.fa.gz \
--ncrna example/Caenorhabditis_elegans.WBcel235.ncrna.fa.gz \
--sample example/reads.txt --paired F --length 50 --out test --strand no
check the output files are as expected:
cd $tespex
diff test/outfile.txt example/outfile.txt
diff test/mapping_stats.txt example/mapping_stats.txt
If nothing is printed it means all went fine. Otherwise, error has rose.
# If single-ended
/path/to/fq1
/path/to/fq2
/path/to/fq3
/path/to/fq4
/path/to/fq1_1 /path/to/fq1_2 /path/to/fq2_1 /path/to/fq2_2 /path/to/fq3_1 /path/to/fq3_2 /path/to/fq4_1 /path/to/fq4_2
# Output files
* outfile.txt: txt file containing the raw counts of reads mapping specifically on TEs. The first column contains the TE names as they are in the TE consensus fasta file, the other columns contain the read counts for each fq input file
* mapping_stats.txt: txt file containing mapping statistics
# How to run TEspeX
TEspeX can be run calling directly the script from the command line. However, for those users disposing a queue managment system we suggest to use a wrapper script that can launch several jobs in parallel saving consistent amount of time. Wrapper file we use on our slurm queue managment system is provided (see the section TEspeX in wrapper mode for more info).
# TEspeX in standard mode
cd $tespex source activate TEspeX_deps python3 TEspeX.py --help
This command shows the help that should be something very similar to:
usage: TEspeX.py [-h] --TE TE --cdna CDNA --ncrna NCRNA --sample SAMPLE --paired PAIRED --length LENGTH --out OUT --strand STRAND [--num_threads NUM_THREADS] [--remove REMOVE] [--index INDEX] [--mask MASK] [--version]
optional arguments: -h, --help show this help message and exit --TE TE fa/fa.gz file containing TE consensus sequences in fasta format [required] --cdna CDNA fa/fa.gz file containing cdna Ensembl sequences in fasta format [required] --ncrna NCRNA fa/fa.gz file containing ncrna Ensembl sequences in fasta format [required] --sample SAMPLE txt file containing fq/fq.gz FULL PATHS. If reads are single end, one path should be written in each line. If reads are paired end the two mates should be written in the same line separated by \t [required] --paired PAIRED T (true) or F (false). T means the reads are paired and consequently the sample file is expected to contain 2 columns. F means the reads are not paired, sample file is expected to contain 1 single column [required] --length LENGTH length of the read given as input. This is used to calculate STAR index parameters. If your fq/fq.gz file contains reads with different length specify the most frequent read length [required] --out OUT directory where the output files will be written. This directory is created by the pipeline, specificy a non- yet-existing directory --strand STRAND strandeness of the RNAseq library. no = unstranded/htseqcount 'no', yes = htseqcount 'yes', reverse = htseqcount 'reverse' --num_threads NUM_THREADS number of threads used by STAR and samtools [2] --remove REMOVE T (true) or F (false). If this parameter is set to T all the bam files are removed. If it is F they are not removed [T] --index INDEX If you want TEspeX to build the index for you, leave the default value [recommended]. Otherwise provide FULL path to a directoray containing STAR indexes [not_recommended] [F] --mask MASK fasta file containing sequences to be masked. If this file is provided, the sequences contained within it are considered as coding/non-coding transcripts and are added to the --cdna and --ncrna fasta files. This might be of help if the users wish to consider some specific regions as belonging to coding/non-coding transcripts even though they are not reported in --cdna and --ncrna fasta files. (e.g., N kb downstream to the transcript TTS for a better handling of readthrough process or non-genic TE-derived sequences known to be passively transcribed from criptic promoters). [F] --multimap MULTIMAP maximum number of loci a read/read pair is allowed to map to. This value will be provided to STAR --outFilterMultimapNmax and --winAnchorMultimapNmax parameters. It is warmly suggested not to change the default value [10] --version show the version number and exit
All the arguments, except fot ```--num_threads```, ```--remove```, ```--index```, ```--mask``` and ```--multimap``` are required.\
We suggest to use as argument of ```--TE``` argument a fasta file containing TE consensus sequences and as arguments of the ```--cdna``` and ```--ncrna``` arguments the transcriptome files containing cdna and ncrna from ensembl (or genecode if working with human or mouse data).\
The ```--length``` of the read is only needed to build the index of the reference transcriptome - in case of trimmed reads just provide the most frequent read lenght.\
As ```--strand``` TEspeX expects the same nomenclature as the one used by htseq-count (if you are unsure about the strandedness of your data please take a look at: https://chipster.csc.fi/manual/library-type-summary.html) .\
By default ```--num_threads``` is set to 2, ```--remove``` is set to T by default (meaning all the bam files are removed) and ```--index``` is set to F (meaning TEspeX will take care about index building).
# TEspeX in wrapper mode
The file wrapper_slurm.py is the **SLURM** wrapper file.
To see the help, type:
cd $tespex source activate TEspeX_deps python3 wrapper_slurm.py -h
This should print to screen:
usage: wrapper_slurm.py [-h] --script SCRIPT --TE TE --cdna CDNA --ncrna NCRNA --sample SAMPLE --paired PAIRED --length LENGTH --out OUT --strand STRAND --job JOB --q Q --walltime WALLTIME [--num_threads NUM_THREADS] [--remove REMOVE] [--mask MASK] [--module MODULE] [--version]
optional arguments: -h, --help show this help message and exit --script SCRIPT path to the TEspeX*.py pipeline for calculation of TE expression [required] --TE TE fa/fa.gz file containing TE consensus sequences in fasta format [required] --cdna CDNA fa/fa.gz file containing cdna Ensembl sequences in fasta format [required] --ncrna NCRNA fa/fa.gz file containing ncrna Ensembl sequences in fasta format [required] --sample SAMPLE txt file containing fq/fq.gz FULL PATHS. If reads are single end, one path should be written in each line. If reads are paired end the two mates should be written in the same line separated by \t [required] --paired PAIRED T (true) or F (false). T means the reads are paired and consequently the sample file is expected to contain 2 columns. F means the reads are not paired, sample file is expected to contain 1 single column [required] --length LENGTH length of the read given as input. This is used to calculateSTAR index parameters. If your fq/fq.gz file contains reads with different length specify the most frequent read length [required] --out OUT directory where the output files will be written. This directory is created by the pipeline, specificy a non- yet-existing directory --strand STRAND strandeness of the RNAseq library. no = unstranded/htseqcount 'no', yes = htseqcount 'yes', reverse = htseqcount 'reverse' --job JOB number of jobs that can be run at the same time --q Q name of the the queue of your SLURM system you want TEspeX to be run (SBATCH -p parameter) [required] --walltime WALLTIME walltime of each job TEspeX will launch on your SLURM system reported in hh:mm:ss format (SBATCH -t parameter) [required] --num_threads NUM_THREADS number of threads used by STAR and samtools. Minimum number of threads: 4 [4] --remove REMOVE T (true) or F (false). If this parameter is set to T all the bam files are removed. If it is F they are not removed [T] --mask MASK fasta file containing sequences to be masked. If this file is provided, the sequences contained within it are considered as coding/non-coding transcripts and are added to the --cdna and --ncrna fasta files. This might be of help if the users wish to consider some specific regions as belonging to coding/non-coding transcripts even though they are not reported in --cdna and --ncrna fasta files. (e.g., N kb downstream to the transcript TTS for a better handling of readthrough process or non-genic TE-derived sequences known to be passively transcribed from criptic promoters). [F] --module MODULE list of modules to be uploaded within each job by 'module load'. If more than one modules need to be loaded provide a comma separated list (e.g., java,samtools) Reported modules should be available through the 'module av' command. [F] --multimap MULTIMAP maximum number of loci a read/read pair is allowed to map to. This value will be provided to STAR --outFilterMultimapNmax and --winAnchorMultimapNmax parameters. It is warmly suggested not to change the default value [10] --version show the version number and exit
The parameters are exactly the same of TEspeX.py script except for 3 parameters:
* --module: it might be that while running job some installed modules need to be upoloaded. If this is the case, provide the list of modules to this parameter. TEspeX will then take care to upload the listed modules in each job by ```module load <module name>```
* --script: it requires the path to TEspeX.py script
* --q: the name of the SLURM queue to be used for the TEspeX job
* --walltime: walltime of each job TEspeX will launch on your SLURM system reported in hh:mm:ss format (SBATCH -t parameter)
* --job: it requires the number of jobs you want to run at the same time. This depends on the settings of your system. If you can run 40 jobs at the same time and you set ```--job 40``` and you have 80 fq/fq.gz written in the txt file given as input to ```--sample``` the wrapper_slurm.py script will:
* subset the ```--sample``` file in 40 sub-files containing 2 (80/40) fq/fq.gz each (named: sample0, sample1, .., sample40)
* create 40 folders (named: 0, 1, .., 40)
* launch 40 different jobs (named: job_0, job_1, .., job_40)
* when all the jobs have finished the cleanup.py job is automatically launched and all the output files are merged together
When all is done you should have in your ```--out``` folder: slurm output files and 4 output files (TE_transc_reference.fa, TE_transc_reference.fai and the 2 output files mapping_stats_total.txt and outfile_total.txt) and 3 directories (index/, mappings/ and tmp/). In the mappings/ directory there is one directory for each fq/fq.gz analyzed containing all the temporary output files.
# How to test for differentially expressed TEs
TEspeX provides a raw-count output file that can be potentially used as input for any of the several tools developed to detect differential expression of genes among two biological conditions (e.g. DESeq2 and edgeR). However, the large majority (all?) of such tools estimate the library size (i.e., sequencing depth) of each analysed sample by summing together the reads mapped on all the genes. While this assumption is perfectly working when analysing gene expression data, this might be unprecise when analysing TE expression data (i.e. the sum of the reads mapping on TE consensus is not a good aproximation of the total library size).
Thus, we **strongly** recommend not to allow the tool to automatically calculate the library size, providing instead as library size the total number of reads TEspeX has succesfully mapped on the reference transcriptome (i.e., TE consensus+cdna+ncrna). This information is contained in the 3rd column of the mapping_stats.txt/mapping_stats_total.txt TEspeX outputs in the working directory.
In edgeR this can be easily done with the following commands:
norm <- read.table("mapping_stats_total.txt",header=T,sep='\t') y <- DGEList(counts=counts,group=group,lib.size = norm$mapped) # assuming conuts is the matrix containing the raw counts and group the factor containing metadata information
It is likely that **the perfect tool** for testing for DE TEs does not exist yet as most of the assumptions made on gene expression data are not necessarily true in the TE scenario. We have internally tested the DE analysis downstream to TEspeX with i) edgeR, ii) DESeq2 and iii) applying a Welch t.test to normalised counts (RPM: raw counts / tot mapped reads \*1M).
Our tests suggest that edgeR works slightly better than both DESeq2 (as it better handles the TEs showing no expression in multiple samples) and t.test (as it is less sensible to the sample size) and we thus suggest to use edgeR. DE testing downstream to TEspeX has been, nevertheless, tested with all the 3 methods providing consistent results in all the tested scenarios.
If you are interested in the development of **the perfect tool** for testing for DE TEs - or at least to try - feel free to contact Federico at (federico.ansaloni@gmail.com).
# Development and help
The TEspeX pipeline has been developed by Federico Ansaloni, former PhD student in the Computational Genomics lab (SISSA/ISAS - Trieste - Italy) of prof. Remo Sanges (https://www.sangeslab.eu). \
Nicolo' Gualandi and Mauro Esposito contributed to the pipeline testing.\
To report bugs or suggestions please feel free to write to the TEspeX supporting group https://groups.google.com/g/tespex-help
# Overview
If you found TEspeX usefull for your research, please cite:
Ansaloni *et al.*, Bioinformatics, 2022: https://doi.org/10.1093/bioinformatics/btac526<br />
Ansaloni *et al.*, BMC Bioinformatics, 2019: https://doi.org/10.1186/s12859-019-3088-7