ealeelab / rtea

2 stars 2 forks source link

ubuntu R Docker GCP bioRxiv

rTea (RNA Transposable Element Analyzer)

rTea is a computational method to detect transposon-fusion RNA. rTea


Overview

We developed rTea to detect TE-fusion transcripts from short-read RNA-seq data. We utilized multiple features from aligned reads, such as base quality of clipped sequences, percentage of multi-mapped reads, and matching score of reads to TE sequences to filter out false positives caused by nonspecifically mapped reads.

Demo and result files

Users can try rTea on a demo data set and can check the output at https://gitlab.aleelab.net/junseokpark/rTea-results

Installation

rTea runs on a Linux-based operating system with certain prerequisite software. Here is a list of the software you should install before you start using rTea.

R -e "BiocManager::install(c( \ 'GenomicAlignments', \ 'BSgenome.Hsapiens.UCSC.hg19', \ 'BSgenome.Hsapiens.UCSC.hg38', \ 'EnsDb.Hsapiens.v75', \ 'EnsDb.Hsapiens.v86' \ ))"

* Download GRCh38 [genome_snp_tran](https://genome-idx.s3.amazonaws.com/hisat/grch38_snptran.tar.gz)

## Use Docker for Installation
Build a Docker file and run ``rTea`` in the Docker container.
```bash
DOCKER_BUILDKIT=1 docker build -t rtea .

Use Singularity for Installation

After creating a Docker image for rTea, convert it to Singularity.

docker save -o rTea.tar rtea:latest
singularity build rTea.simg docker-archive://rTea.tar

Running rTea

If you are using Docker as your runtime environment, run the Docker image to execute rTea.

docker exec -it -v ${GENOME_SNP_TRAN_DIR}:/app/rTea/hg38/genome_snp_tran rtea bash

If the runtime environment is Singularity, execute the Singularity image to run rTea.

singularity shell -B ${GENOME_SNP_TRAN_DIR}:/app/rTea/hg38/genome_snp_tran \
    rTea.simg

rTea supports paired-end FASTQ files and a BAM file as input. For FASTQ file input, use the following command:

rTea.sh \
        ${R1.fq}.gz \
        ${R2.fq}.gz \
        $SAMPLE_NAME \
        $GENOME_SNP_TRAN_DIR \
        $NUMBER_OF_CORES \
        $OUT_DIR \
        hg38 \
        resume

For BAM file input, please use the following command:

rnatea_pipeline_from_bam \
        ${BAM} + \
        $SAMPLE_NAME \
        $GENOME_SNP_TRAN_DIR \
        $NUMBER_OF_CORES \
        $OUT_DIR \
        hg38

Output file

After running rTea, the user can find a .rTea.txt file in the rTea directory, which contains information about TEs and other supporting data. Column Description
chr Chromosome name
pos Fusion breakpoint position on the chromosome
ori Fusion direction on the chromosome (f, TE|gene; r, gene|TE)
class TE class
seq Proximal portion of fusion sequence
isPolyA Whether it is a fusion with polyA sequence
posRepFamily Repeat masked repeat family on the breakpoint position
posRep Repeat masked repeat element on the breakpoint position
TEfamily TE family with highest alignment score when fusion sequence is aligned with consensus TE sequence
TEscore Alignment score of fusion sequence with the consensus TE sequence
TEside Fusion direction on the consensus TE sequence (5, TE|gene; 3, gene|TE)
TEbreak Fusion breakpoint position on the consensus TE sequence
depth Number of RNA-seq reads on the breakpoint position
matchCnt Number of fusion-supporting RNA-seq reads
polyAcnt Number of polyA reads
baseQual Median base quality of supporting reads
lowMapQual Number of supporting reads that have low mapping quality
mateDist Minimum distance of mate reads
overhang Distance of breakpoint from splice site
gap Length of nearby intron
secondary Proportion of supporting reads that are from secondary alignment
nonspecificTE Mean alignment score of supporting reads to consensus TE sequence
r1pstrand Proportion of supporting reads that are from positive strand of chromosome
fusion_tx_id Transcript ID of the fusion transcript
tx_support_exon Number of read fragments spanning exonic region of the fusion transcript ID
tx_support_intron Number of read gaps matching the fusion transcript ID
strand Strand of fusion transcript
pos_type Genomic region of breakpoint
polyTE Known non-reference TE on the breakpoint position
hardstart Start position of nearby reference genome where fusion sequence came from
hardend End position of nearby reference genome where fusion sequence came from
hardTE Repeat masked TE subfamily of nearby reference genome where fusion sequence came from
hardDist Distance from fusion breakpoint to nearby reference genome where fusion sequence came from
fusion_type Type of TE fusion
fusion_tx_biotype Biotype of fusion transcript
fusion_gene_id Gene ID of fusion transcript
fusion_gene_name Gene symbol of fusion transcript
Filter Filter reason of low confidence fusion

Licenses

License: MIT License: CC BY-NC 4.0 License: GPL v2 License: GPL v3

Contacts

Junseok Park Boram Lee