DaehwanKimLab / hisat2

Graph-based alignment (Hierarchical Graph FM index)
GNU General Public License v3.0
480 stars 119 forks source link

hisat2 does not handle rRNA reads well #43

Open bwlang opened 8 years ago

bwlang commented 8 years ago

I think hisat is not handling ribosomal RNA reads very well (see image). Neither are Star or Tophat2. Tophat also finds a bunch of strange splice junctions (top) Hisat is finding a bunch of splice junctions that are probably not there (middle) Star does better on the splice junctions (bottom), but it only seems to map to the first locus it sees.

Hisat is particularly problematic because many more reads are split across the two 45S loci than with the other mappers. Feeding a hisat bam file to Picard's RNASeq Metrics tool produces about 50% rRNA, while Tophat, Star, and k-mer classifier produce ~80% rRNA.

This may be compounded by the fact that the genome reference is not complete for rRNA regions.

Are there settings I should be changing to handle this situation better? I could disable splicing, but that doesn't make much sense since it would screw up the mapping of the mRNA.

Right now I'm going to try filtering rRNA as pre-mapping step, then mapping with bowtie and merging the bams. I'll attach some downsampled fastq files and my rRNA interval list in case they're of use to you.

spliced_mappers_rrna

bwlang commented 8 years ago

Here is a small set of reads that illustrates the problem. I baited pairs using mirabait and http://hgdownload.cse.ucsc.edu/goldenpath/hg19/snp138Mask/chrUn_gl000220.fa (requiring 6 31mers) in either read, then sampled 50k pairs with ngs-tools Then I aligned the reads to hg19 using the attached refseq gtf as a guide and in directional mode where i could specify these.

undepleted_chrUn_gl000220.1.fastq.gz undepleted_chrUn_gl000220.2.fastq.gz

encode_and_RM_rRNA.merged.interval.gz hg19_genes.gtf.gz

Compared with the larger dataset above, the Tophat "false" splice junctions are less obvious, and star more evenly distributes the reads across the two loci (don't understand this... but that's a different rabbit hole). spliced_mappers_rrna_100k

infphilo commented 8 years ago

Hi @bwlang,

Thank you for your detailed information (and sorry for the late response). It looks like chrUn_gl000220.fa contains many repeats (as shown in lowercase) and sequences of low complexity, which could mainly explain why HISAT2 reports many spliced alignments. Does this issue generally happen across the whole human genome or on this particular sequence (chrUn_gl000220)?

With HISAT2, we can be more conservative regarding spliced alignment using --no-temp-splicesite and --dta-cufflinks options, but they could have negative impacts on the alignment of non-ribosome RNA reads.

Thanks, Daehwan

bwlang commented 8 years ago

Hi Daehwan:

This does occur when mapping reads to the entire human genome, but I don't observe this strange splicing pattern in other loci (though I'm not sure I'd know). ChrUn_gl000220 contains the 45S pre-ribosomal RNA loci. These can be up to 90% of all transcripts in the cell, and individuals can have many repeats of these loci. Both GRCh38 and 37 only represent 45S rRNA loci in an unplaced contig containing 2 repeats. Reads from many other loci are probably compressed onto these two repeats. I will try the --no-temp-splicesite and --dta-cufflinks options to see what I get. Right now I'm happy with mapping these reads via bowtie and doing the rest via Hisat, but it's a bit of a cumbersome workflow. I wonder if Hisat could do a better job by increasing the cost of gap extension (or whatever is similar for hisat). That might get it to use the nearest possible splice site instead of the most closely matching.