fansalon / TEspeX

TEspeX - pipeline for Transposable Elements expression quantification
GNU General Public License v3.0
19 stars 2 forks source link

TEspeX

Read about TEspeX in:

Ansaloni et al., Bioinformatics, 2022: https://doi.org/10.1093/bioinformatics/btac526
Ansaloni et al., BMC Bioinformatics, 2019: https://doi.org/10.1186/s12859-019-3088-7

Overview

TEspeX (Transposable Elements SPEcific eXpression) is a tool for the TE expression quantification from Illummina short RNA-seq reads. The rationale of the pipeline is to map the reads against a reference transcriptome composed by i) TE consensus sequences, ii) coding transcripts and iii) non-coding transcripts and to quantify the expression of TEs avoiding counting reads deriving from TE fragments embedded in coding/non-coding non-TE annotated transcripts.

Techincally, TEspeX:

Possible scenarios:

The pipeline has been written in python3 and it has been tested on Ubuntu, CentOS and Mac OS X systems.

How to install TEspeX on Unix systems

Install TEspeX and TEspeX prerequisites

0. Prerequisites

Please note how this manual assumes you have already installed and in $PATH:

You can easily check this by typing which <name>. If they are not installed, please install them before installing TEspeX.

1. conda

If paths are printed, you can directly go to 2., otherwise follow the steps below:

2. java (JDK)

If this prints something like:

openjdk version "11.0.11" 2021-04-20
OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.18.04)
OpenJDK 64-Bit Server VM (build 11.0.11+9-Ubuntu-0ubuntu2.18.04, mixed mode, sharing)

Skip to 3.; otherwise:

3. git

If the git version is printed you can directly go to 4., otherwise type:

4. zlib

It is likely that you already have zlib installed and properly configured somewhere. To check this type:

If paths to zlib lib and include folders are returned, everything should be OK - skip to 5.

Otherwise:

5. clone TEspeX and create working directory variable

6. Picard

A file called 'picard.jar' is contained in the 'bin/picard' directory.\ To check whether java is properly installed on your machine and picard properly works, type:

If the picard help is printed everything is fine. If an error rises java is not (properly) installed on your machine. Possible solutions: i) go back to 2. and check conda has successfully installed java, ii) check that the java installed by conda is in $PATH and iii) check that the invoked java is really the one installed by conda.

7. STAR

STAR executables should be in the 'bin/STAR-2.6.0c' directory.\ To check whether STAR is properly working, type:

this should return: STAR_2.6.0c

8. samtools

install samtools-1.3.1

cd $tespex/bin
wget -O samtools-1.3.1.tar.bz2 https://github.com/samtools/samtools/releases/download/1.3.1/samtools-1.3.1.tar.bz2
tar xjf samtools-1.3.1.tar.bz2
cd samtools-1.3.1
./configure --prefix=$PWD/
(if during the --configure step you encounter an error like: "configure: error: curses development files not found" please relaunch the command adding the "--without-curses" flag)
make
make install
bin/samtools --version

this should return something like:

samtools 1.3.1
Using htslib 1.3.1
Copyright (C) 2016 Genome Research Ltd.

9. create a conda environment where the required version of python, pysam and pandas are installed

To create the conda environment and install the required libraries type:

conda create -n TEspeX_deps --override-channels -c bioconda -c defaults python=3.6 pandas=0.23.0 pysam'>=0.15.0,<=0.15.1'
## --> you will be asked to let conda download and install new packages: type Y

# activate the environment - to be done every time TEspeX is used
# **PLEASE DO NOT CHANGE THE ENVIRONMENT NAME AS IT WILL COMPROMISE THE FUNCTIONING OF THE PIPELINE**
source activate TEspeX_deps

# to check everything properly worked
which python3
## --> /path/to/envs/TEspeX_deps/bin/python3
which pip3
## --> /path/to/envs/TEspeX_deps/bin/pip3
python3 --version
## --> Python 3.6.13 :: Anaconda, Inc.
pip3 --version
## --> pip 21.1.3 from /path/to/envs/TEspeX_deps/lib/python3.6/site-packages/pip (python 3.6)

# deactivate the environment
conda deactivate

10. Test TEspeX has been successfully installed

In the folder 'example' a copy of the files used to perform the TE expression analysis on 2 C. elegans embryonic fastq files (Tintori SC, et al. - Dev. Cell - 2016 - https://www.ncbi.nlm.nih.gov/pubmed/27554860) is deposited. Please, please, DO test whether the pipeline is working properly following the steps below:

Input files

If paired-ended

/path/to/fq1_1 /path/to/fq1_2 /path/to/fq2_1 /path/to/fq2_2 /path/to/fq3_1 /path/to/fq3_2 /path/to/fq4_1 /path/to/fq4_2


# Output files
* outfile.txt: txt file containing the raw counts of reads mapping specifically on TEs. The first column contains the TE names as they are in the TE consensus fasta file, the other columns contain the read counts for each fq input file
* mapping_stats.txt: txt file containing mapping statistics

# How to run TEspeX
TEspeX can be run calling directly the script from the command line. However, for those users disposing a queue managment system we suggest to use a wrapper script that can launch several jobs in parallel saving consistent amount of time. Wrapper file we use on our slurm queue managment system is provided (see the section TEspeX in  wrapper mode for more info).

# TEspeX in standard mode

cd $tespex source activate TEspeX_deps python3 TEspeX.py --help


This command shows the help that should be something very similar to:

usage: TEspeX.py [-h] --TE TE --cdna CDNA --ncrna NCRNA --sample SAMPLE --paired PAIRED --length LENGTH --out OUT --strand STRAND [--num_threads NUM_THREADS] [--remove REMOVE] [--index INDEX] [--mask MASK] [--version]

optional arguments: -h, --help show this help message and exit --TE TE fa/fa.gz file containing TE consensus sequences in fasta format [required] --cdna CDNA fa/fa.gz file containing cdna Ensembl sequences in fasta format [required] --ncrna NCRNA fa/fa.gz file containing ncrna Ensembl sequences in fasta format [required] --sample SAMPLE txt file containing fq/fq.gz FULL PATHS. If reads are single end, one path should be written in each line. If reads are paired end the two mates should be written in the same line separated by \t [required] --paired PAIRED T (true) or F (false). T means the reads are paired and consequently the sample file is expected to contain 2 columns. F means the reads are not paired, sample file is expected to contain 1 single column [required] --length LENGTH length of the read given as input. This is used to calculate STAR index parameters. If your fq/fq.gz file contains reads with different length specify the most frequent read length [required] --out OUT directory where the output files will be written. This directory is created by the pipeline, specificy a non- yet-existing directory --strand STRAND strandeness of the RNAseq library. no = unstranded/htseqcount 'no', yes = htseqcount 'yes', reverse = htseqcount 'reverse' --num_threads NUM_THREADS number of threads used by STAR and samtools [2] --remove REMOVE T (true) or F (false). If this parameter is set to T all the bam files are removed. If it is F they are not removed [T] --index INDEX If you want TEspeX to build the index for you, leave the default value [recommended]. Otherwise provide FULL path to a directoray containing STAR indexes [not_recommended] [F] --mask MASK fasta file containing sequences to be masked. If this file is provided, the sequences contained within it are considered as coding/non-coding transcripts and are added to the --cdna and --ncrna fasta files. This might be of help if the users wish to consider some specific regions as belonging to coding/non-coding transcripts even though they are not reported in --cdna and --ncrna fasta files. (e.g., N kb downstream to the transcript TTS for a better handling of readthrough process or non-genic TE-derived sequences known to be passively transcribed from criptic promoters). [F] --multimap MULTIMAP maximum number of loci a read/read pair is allowed to map to. This value will be provided to STAR --outFilterMultimapNmax and --winAnchorMultimapNmax parameters. It is warmly suggested not to change the default value [10] --version show the version number and exit


All the arguments, except fot ```--num_threads```, ```--remove```, ```--index```, ```--mask``` and ```--multimap``` are required.\
We suggest to use as argument of ```--TE``` argument a fasta file containing TE consensus sequences and as arguments of the ```--cdna``` and ```--ncrna``` arguments the transcriptome files containing cdna and ncrna from ensembl (or genecode if working with human or mouse data).\
The ```--length``` of the read is only needed to build the index of the reference transcriptome - in case of trimmed reads just provide the most frequent read lenght.\
As ```--strand``` TEspeX expects the same nomenclature as the one used by htseq-count (if you are unsure about the strandedness of your data please take a look at: https://chipster.csc.fi/manual/library-type-summary.html) .\
By default ```--num_threads``` is set to 2,  ```--remove``` is set to T by default (meaning all the bam files are removed)  and ```--index``` is set to F (meaning TEspeX will take care about index building).

# TEspeX in wrapper mode

The file wrapper_slurm.py is the **SLURM** wrapper file. 

To see the help, type:

cd $tespex source activate TEspeX_deps python3 wrapper_slurm.py -h


This should print to screen:

usage: wrapper_slurm.py [-h] --script SCRIPT --TE TE --cdna CDNA --ncrna NCRNA --sample SAMPLE --paired PAIRED --length LENGTH --out OUT --strand STRAND --job JOB --q Q --walltime WALLTIME [--num_threads NUM_THREADS] [--remove REMOVE] [--mask MASK] [--module MODULE] [--version]

optional arguments: -h, --help show this help message and exit --script SCRIPT path to the TEspeX*.py pipeline for calculation of TE expression [required] --TE TE fa/fa.gz file containing TE consensus sequences in fasta format [required] --cdna CDNA fa/fa.gz file containing cdna Ensembl sequences in fasta format [required] --ncrna NCRNA fa/fa.gz file containing ncrna Ensembl sequences in fasta format [required] --sample SAMPLE txt file containing fq/fq.gz FULL PATHS. If reads are single end, one path should be written in each line. If reads are paired end the two mates should be written in the same line separated by \t [required] --paired PAIRED T (true) or F (false). T means the reads are paired and consequently the sample file is expected to contain 2 columns. F means the reads are not paired, sample file is expected to contain 1 single column [required] --length LENGTH length of the read given as input. This is used to calculateSTAR index parameters. If your fq/fq.gz file contains reads with different length specify the most frequent read length [required] --out OUT directory where the output files will be written. This directory is created by the pipeline, specificy a non- yet-existing directory --strand STRAND strandeness of the RNAseq library. no = unstranded/htseqcount 'no', yes = htseqcount 'yes', reverse = htseqcount 'reverse' --job JOB number of jobs that can be run at the same time --q Q name of the the queue of your SLURM system you want TEspeX to be run (SBATCH -p parameter) [required] --walltime WALLTIME walltime of each job TEspeX will launch on your SLURM system reported in hh:mm:ss format (SBATCH -t parameter) [required] --num_threads NUM_THREADS number of threads used by STAR and samtools. Minimum number of threads: 4 [4] --remove REMOVE T (true) or F (false). If this parameter is set to T all the bam files are removed. If it is F they are not removed [T] --mask MASK fasta file containing sequences to be masked. If this file is provided, the sequences contained within it are considered as coding/non-coding transcripts and are added to the --cdna and --ncrna fasta files. This might be of help if the users wish to consider some specific regions as belonging to coding/non-coding transcripts even though they are not reported in --cdna and --ncrna fasta files. (e.g., N kb downstream to the transcript TTS for a better handling of readthrough process or non-genic TE-derived sequences known to be passively transcribed from criptic promoters). [F] --module MODULE list of modules to be uploaded within each job by 'module load'. If more than one modules need to be loaded provide a comma separated list (e.g., java,samtools) Reported modules should be available through the 'module av' command. [F] --multimap MULTIMAP maximum number of loci a read/read pair is allowed to map to. This value will be provided to STAR --outFilterMultimapNmax and --winAnchorMultimapNmax parameters. It is warmly suggested not to change the default value [10] --version show the version number and exit


The parameters are exactly the same of TEspeX.py script except for 3 parameters:
* --module: it might be that while running job some installed modules need to be upoloaded. If this is the case, provide the list of modules to this parameter. TEspeX will then take care to upload the listed modules in each job by ```module load <module name>```
* --script: it requires the path to TEspeX.py script
* --q: the name of the SLURM queue to be used for the TEspeX job
* --walltime: walltime of each job TEspeX will launch on your SLURM system reported in hh:mm:ss format (SBATCH -t parameter)
* --job: it requires the number of jobs you want to run at the same time. This depends on the settings of your system. If you can run 40 jobs at the same time and you set ```--job 40``` and you have 80 fq/fq.gz written in the txt file given as input to ```--sample``` the wrapper_slurm.py script will: 
    * subset the ```--sample``` file in 40 sub-files containing 2 (80/40) fq/fq.gz each (named: sample0, sample1, .., sample40)
    * create 40 folders (named: 0, 1, .., 40)
    * launch 40 different jobs (named: job_0, job_1, .., job_40)
    * when all the jobs have finished the cleanup.py job is automatically launched and all the output files are merged together

When all is done you should have in your ```--out``` folder: slurm output files and 4 output files (TE_transc_reference.fa, TE_transc_reference.fai and the 2 output files mapping_stats_total.txt and outfile_total.txt) and 3 directories (index/, mappings/ and tmp/). In the mappings/ directory there is one directory for each fq/fq.gz analyzed containing all the temporary output files.

# How to test for differentially expressed TEs
TEspeX provides a raw-count output file that can be potentially used as input for any of the several tools developed to detect differential expression of genes among two biological conditions (e.g. DESeq2 and edgeR). However, the large majority (all?) of such tools estimate the library size (i.e., sequencing depth) of each analysed sample by summing together the reads mapped on all the genes. While this assumption is perfectly working when analysing gene expression data, this might be unprecise when analysing TE expression data (i.e. the sum of the reads mapping on TE consensus is not a good aproximation of the total library size).

Thus, we **strongly** recommend not to allow the tool to automatically calculate the library size, providing instead as library size the total number of reads TEspeX has succesfully mapped on the reference transcriptome (i.e., TE consensus+cdna+ncrna). This information is contained in the 3rd column of the mapping_stats.txt/mapping_stats_total.txt TEspeX outputs in the working directory.

In edgeR this can be easily done with the following commands:

norm <- read.table("mapping_stats_total.txt",header=T,sep='\t') y <- DGEList(counts=counts,group=group,lib.size = norm$mapped) # assuming conuts is the matrix containing the raw counts and group the factor containing metadata information



It is likely that **the perfect tool** for testing for DE TEs does not exist yet as most of the assumptions made on gene expression data are not necessarily true in the TE scenario. We have internally tested the DE analysis downstream to TEspeX with i) edgeR, ii) DESeq2 and iii) applying a Welch t.test to normalised counts (RPM: raw counts / tot mapped reads \*1M).

Our tests suggest that edgeR works slightly better than both DESeq2 (as it better handles the TEs showing no expression in multiple samples) and t.test (as it is less sensible to the sample size) and we thus suggest to use edgeR. DE testing downstream to TEspeX has been, nevertheless, tested with all the 3 methods providing consistent results in all the tested scenarios.

If you are interested in the development of **the perfect tool** for testing for DE TEs - or at least to try - feel free to contact Federico at (federico.ansaloni@gmail.com).

# Development and help
The TEspeX pipeline has been developed by Federico Ansaloni, former PhD student in the Computational Genomics lab (SISSA/ISAS - Trieste - Italy) of prof. Remo Sanges (https://www.sangeslab.eu). \
Nicolo' Gualandi and Mauro Esposito contributed to the pipeline testing.\
To report bugs or suggestions please feel free to write to the TEspeX supporting group https://groups.google.com/g/tespex-help

# Overview
If you found TEspeX usefull for your research, please cite:

Ansaloni *et al.*, Bioinformatics, 2022: https://doi.org/10.1093/bioinformatics/btac526<br />
Ansaloni *et al.*, BMC Bioinformatics, 2019: https://doi.org/10.1186/s12859-019-3088-7