bioturing / hera

MIT License
71 stars 9 forks source link

Developed by BioTuring (www.bioturing.com), hera is a bioinformatics tool that helps analyze RNA-seq data. With a single command line, hera provides:

Each process in hera was carefully organized and optimized in order to maximize the performance in term of time and accuracy. Hera quantification algorithm obtained the best ranking in a recent round of the SMC-RNA DREAM challenge: https://www.synapse.org/#!Synapse:syn2813589/wiki/423306

Example data

We using Sim41 dataset from Synapse Dream Challenge SMC-RNA round 4 (https://www.synapse.org/SMC_RNA), which contains 60 million read pairs with length is 100bp (gzipped input). The test was done on a 32-core machine running Ubuntu 14.04. The result is shown in the table below:

Transcriptome Transcriptome + Genome
Alignment
Mapped read 93.3821% 93.3851%
Memory 6.2GB 23.7GB
Abundance estimation results
Spearman 0.92454 0.92537
Pearson 0.94428 0.94428

Core algorithm

Alignment

In hera, alignment starts with a hash-based approach that is applied on all the reads to anchor gene fragments. Then, Needleman–Wunsch algorithm is used to fill in the gaps between these anchored seeds. With this approach, the mapping time is reduced without hurting the precision. An additional conversion of transcriptome takes place to generate genome coordinates from the original transcriptome coordinates. This step provides a much better accuracy for splicing detection than mapping data onto a reference genome.

In another hand, hera is still able to perform the common genome mapping due to the incompletion of available transcriptomics. Any reads that cannot be mapped properly on the transcriptome will be remapped to the genome later. The procedure for this case is the same as the transcriptome mapping except the hash-based method is replaced with the Burrow-Wheeler Transform algorithm.

Abundance estimation

Expectation–maximization algorithm is optimized with the SQUAREM procedure (Varadhan, R. & Roland, C. Scand. J. Stat. 35, 335–353 (2008)).

Build requirements:

Install:

  1. git clone https://github.com/bioturing/hera.git
  2. cd hera/
  3. chmod +x build.sh
  4. ./build.sh

For Linux users, hera can be easily installed by using bioconda:

  conda config --add channels bioconda
  conda install hera

Usage:

INDEX:

  ./hera_build
          --fasta genome_sequence.fa (text file only)
          --gtf annotation_file.gtf
          --outdir path/to/output_directory
 [OPTIONAL]
          --full_index 0: None, 1: index full genome
          --grch38     0: No, 1: Yes

By default, hera needs ~8GB for transcriptome indexing only. Full genome indexing needs ~30GB. You also can download indexed human genome file here: GRCh37.75, GRCh38.82

If you are running with GRCh38 human genome included ALT contigs, you should let hera know by defining --grch38 1.

RUN:

  ./hera quant [arguments]

  Required arguments:
    -1 <read-files>    Input left read-files, separated by space
    -2 <read-files>    Input right read-files, separated by space
                       (using -1 only if quantify for single-end reads)
    -i <STRING>        path to hera index directory

  Optional arguments:
    -o <STRING>        Output directory (default: ./)
    -t <INT>           Number of threads (default: 1)
    -b <INT>           Number of bootstrap samples (default: 0)
    -f <genome-file>   Genome mapping, need full index to use this option
                       (if not define, genome mapping will be ignore)
    -p <STRING>        Output prefix (default: '')
    -w                 Output bam file
    -z <INT>           Bam compress level (1 - 9) (default: -1)
    -v                 Verbose mode
    -h                 Print help

  Example: 
  ./hera quant -i hera_index/ -w -t 32 -b 100 -o hera_output/
    -1 read_1.fq.gz -2 read_2.fq.gz
  (Output bam file, 32 threads, 100 bootstrap samples, paired-end mode)

  ./hera quant -i hera_index/ -t 32 -f GRCh37_75_homo_sapiens.fa
    -o hera_output/ -1 read_lane_1.fq.gz read_lane_2.fq.gz
  (No bam, 32 threads, genome mapping, single-end mode with multiple file)
  1. Hera index directory: Directory contain index file from previous index step

  2. Genome fasta file: If not defined, genome mapping will be ignore. Mapping on transcriptome needs ~8BG, but mapping with genome needs ~30GB.

  3. Output file include:

    • abundance.tsv : Transcripts abundance estimation (tsv file)
    • abundance.h5 : Transcripts abundance estimation and boostrapping result (hdf5 file)
    • transcript.bam : Alignment result
  4. In the built-from-source version, reading from multiple read files from multiple lanes is supported, files in different lanes are separated by space. Example: hera quant -i index/ -t 32 -1 read_lane1_1.fastq read_lane2_1 -2 read_lane1_2 fastq,read_lane2_2

Third-party

hera includes some third-patry software:

Contacts

Please report any issues directly to the github issue tracker. Also, you can send your feedback to sonpham@bioturing.com

Contributions

BioTuring Algorithm Team & Thao Truong, Khoa Nguyen, Tuan Tran, Thang Tran and Son Pham

License

MIT license

Copyright (c) BioTuring Inc. 2017 All rights reserved. This Hera 1.0 version is freely accessible for both academic and industry users.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.