DaehwanKimLab / tophat

Spliced read mapper for RNA-Seq
http://ccb.jhu.edu/software/tophat
Boost Software License 1.0
91 stars 46 forks source link

Missing alignments #31

Open yevshin opened 8 years ago

yevshin commented 8 years ago

When the read maps to several places equally well and one of these mappings span exon/intron junction only one alignment is reported (alignment spanning exon/intron junction). I will provide test case: file genome.fa:

1 AAAAAAAAAAGGGGCCCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGTTTTTTTTTCCCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

file test.gtf:

1 test exon 1 50 . + . exon_id "E1"; exon_number "1"; gene_id "G1"; transcript_id "T1"; tss_id "TSS1"; 1 test exon 60 100 . + . exon_id "E2"; exon_number "2"; gene_id "G1"; transcript_id "T1"; tss_id "TSS1"; 1 test transcript 1 100 . + . gene_id "G1"; transcript_id "T1"; tss_id "TSS1";

file test.fq:

@r1 AAAAAAAAAAGGGGCCCCAAAAAAAAAA + IIIIIIIIIIIIIIIIIIIIIIIIIIII

commands to run:

bowtie2-build genome.fa genome tophat2 --transcriptome-only -G test.gtf genome test.fq

this will produce file tophat_out/accepted_hits.bam containing single alignment:

samtools view tophat_out/accepted_hits.bam r1 0 1 37 50 14M9N14M * 0 0 AAAAAAAAAAGGGGCCCCAAAAAAAAAA IIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:28 YT:Z:UU XS:A:+ NH:i:1

but there should be two alignments. Note also that alignment is reported as unique (score 50 and NH:i:1), and when the user will filter uniquely mapped reads it will get many reads spanning exon/intron junctions that are actually not unique.

I tracked the problem to tophat/src/align_status.cpp:138 where score for alignments spanning exon/intron junctions is increased by 2, subsequently all other alignments will be considered as secondary since they have less score (by 2). I don't understand the reason for re-weighting scores of these alignments, but commenting out this line solves the problem.