DaehwanKimLab / hisat2

Graph-based alignment (Hierarchical Graph FM index)
GNU General Public License v3.0
472 stars 116 forks source link

Non-random assignment of multi-mapping reads #59

Closed FelixKrueger closed 8 years ago

FelixKrueger commented 8 years ago

Hi Daewhan,

We have encountered a scenario where reads aligning to two tandemly repeated genes (here in the yeast genome) get preferentially placed to only one of the copies (top data track, left), see here:

tandem_repeats In this example it would appear that one gene is 4x higher expressed than the other which is probably wrong, Tophat seems to get a 50/50 split right (bottom data track). We just checked the genome sequence again and the regions really share 100% identity.

This imbalance may also be found at other (repetitive) places, so we were wondering if this is intentional or whether there was any fix for that?

Many thanks, Felix

infphilo commented 8 years ago

Thank you for bringing this issue to our attention. This is not the way HISAT2 is supposed to work. We want 50/50 split of reads. It looks like it may involve something other than just two repetitive sequences in the perspective of the alignment algorithms of HISAT2. Would it be possible for you to provide me a set of reads so that I can reproduce and fix this non-random issue?

FelixKrueger commented 8 years ago

Hi Daewhan,

I have now extracted the reads from the regions interest using samtools viewfull_R64-2-1_hisat2_sorted.bam chrVIII:212100-212200 and then converted them to FastQ using Picard's SamToFastq. After re-alignment of the FastQ file interestingly the aligned reads were mostly shifted to the right gene which further shows that something weird must be going on (also some of the extracted reads e.g. in the middle section do no longer align with HISAT2 but that might be a different issue). Tophat got the 50/50 split right again.

region_extracted

I have set up an FTP site for you to play with under http://ftp2.babraham.ac.uk/ftpusr63/ (do email me if you need FTP access rather than download only). The site contains the yeast reference genome as .fa, the BAM that was extracted from the previous HISAT2 alignment (region.bam), the extracted FastQ file (region.fastq), and the subsequent alignments of this FastQ file using our HISAT2 or Tophat Clusterflow pipeline (please note that HISAT2 was run with --sp 1000,1000, the command line should be in the @PG header). Let me know if there is anything else I can assist you with. Best wishes, Felix

infphilo commented 8 years ago

Thank you again for providing the test data set. I have fixed this issue, with the fix available in the master branch.

This has to do with some routines I implemented to deal with a mapping ambiguity problem due to pseudogenes. When a read is equally mapped to two distinct locations, HISAT2 prefers the one that is close to splice sites to the other that has no splice sites nearby. There was a minor bug that accounts for this non-randomness assignment, which I fixed.

Please let me know if you still encounter this issue.

FelixKrueger commented 8 years ago

Great stuff, this worked like a charm! Great turn-around time as well, many thanks! Best, Felix