COMBINE-lab / RapMap

Rapid sensitive and accurate read mapping via quasi-mapping
GNU General Public License v3.0
89 stars 23 forks source link

quasimapping, second read always gets not primary flag #21

Closed mikelove closed 8 years ago

mikelove commented 8 years ago

hi,

I ran the following, using RapMap 0.2.0:

rapmap quasiindex -i quasiindex -t transcripts.fa
rapmap quasimap -o maps.sam -1 <(zcat ERR188021_1.fastq.gz | head -16000000) -2 <(zcat ERR188021_2.fastq.gz | head -16000000) -i quasiindex

And get the following output, where it seems that the second read in the pair always has the "not primary alignment" flag:

ERR188021.24    99      ENSG00000254244 790     1       75M     =       845     130     TTGATGGGCTCTACTTCTGATCTTGGTCACTGTGAAAAAATCAAGAAGGCCTGTGGAAATTTTGGCATTCCATGT     @CCFDFFFHHHHDIJJIJJJIGJIJJHIIJIJHGGIJJJJCGIIJGHGGIHIJDGIJIEIIIIJFHIJJHHHHHE     NH:i:1
ERR188021.24    2451    ENSG00000254244 845     1       75M     =       790     130     AAATTTTGGCATTCCATGTGAACTTCGAGTAACATCTGCGCATAAAGGACCAGATGAAACTCTGAGGATTAAAGA     HHGIGGJIGGJIJIGGIJIIHHFCGJGIGFJIIJIIHHGIIJIJJJJIHJHGGJIJJIJIJHHHHFFFFFFCCC~     NH:i:1
ERR188021.25    99      ENSG00000198211 2205    1       75M     =       2354    224     CGCCCCGCGGCCTGAAGATGTCGGCCACCTTCATCGGCAACAGCACGGCCATCCAGGAGCTGTTCAAGCGCATCT     @@CFFDDDHFDHHEGAEEGGFCDEIGIJCFHHD@DH8BHFBFC@@@?>@B:>AACDCB@8AC::CDDD:8;555>     NH:i:1
ERR188021.25    2451    ENSG00000198211 2354    1       75M     =       2205    224     AGTTCACCGAGGCCGAGAGCAACATGAACGACCTGGTGTCCGAGTACCAGCAGTACCAGGACGCCACGGCCGACA     C@CD@?DBADDBDCCC@CB@DDCCDEB:;ECHGGFF@ABBFBBBB?FAHEHEFD:DDEEDFF<DC8D?DDD?@@~     NH:i:1
ERR188021.33    83      ENSG00000198727 952     1       75M     =       916     111     GCCCACTAAGCCAATCACTTTATTGACTCCTAGCCGCAGACCTCCTCATTCTAACCTGAGTCGGAGGACAACCAA     BA9@@>@ACFDBEHEC=;EECHAE@7>GEIIIHFIGIIIFBFCJIIGHGHGHHEHGGIHGGGHFFHDDDDD<@@~     NH:i:1
ERR188021.33    2467    ENSG00000198727 916     1       75M     =       952     111     ATCCTCCATATATCCAAACAACAAAGCATAATATTTCGCCCACTAAGCCAATCACTTTATTGACTCCTAGCCGCA     B@@FFDFFGDHAFHIGBGIGGIGGHEGIIJDHJJJJHEIIJIFIGIGJIJGHGGGIGGGGGGHGHGGGIJIHHFF     NH:i:1
...

Maybe I'm missing something obvious!

rob-p commented 8 years ago

Hi @mikelove,

Indeed; this seems wrong. Since we don't compute a full alignment here, the notion of which alignment is "primary" for multimapping reads is somewhat arbitrary. However, both mates in a non-orphaned alignment (as these seem to be) should be marked as primary. I think this slipped detection thus far since we don't really end up making use of the SAM flags in Sailfish / Salmon (since we interface with RapMap directly and the relevant information is coded in the QuasiMapping structure). I'm guessing that 147 should be the right flag for the first two of these alignments, right? I can fix this and cut a new version.

--Rob

mikelove commented 8 years ago

I should confess I'm no expert on what the SAM flags should be. I always have to go to https://broadinstitute.github.io/picard/explain-flags.html

I only know when I get something discordant between mates because the Bioconductor pairing function (within GenomicAlignments's readGAlignmentPairs) gives me some indication. This function is picky about what it considers a "GAlignmentPair":

2 records rec1 and rec2 are considered mates iff all the following conditions are satisfied: – (A) QNAME(rec1) == QNAME(rec2) – (B) RNEXT(rec1) == RNAME(rec2) and RNEXT(rec2) == RNAME(rec1) – (C) PNEXT(rec1) == POS(rec2) and PNEXT(rec2) == POS(rec1) – (D) Flag bit 0x20 of rec1 == Flag bit 0x10 of rec2 and Flag bit 0x20 of rec2 == Flag bit 0x10 of rec1 – (E) rec1 corresponds to the first segment in the template and rec2 corresponds to the last segment in the template, OR, rec2 corresponds to the first segment in the template and rec1 corresponds to the last segment in the template – (F) rec1 and rec2 have same flag bit 0x2 – (G) rec1 and rec2 have same flag bit 0x100

where for these mates above G is not true.

rob-p commented 8 years ago

I think my confusion / the problem is coming from this requirement in the SAM spec:

For each read/contig in a SAM file, it is required that one and only one line associated with the read satisfies ‘FLAG & 0x900 == 0’. This line is called the primary line of the read.

I was taking this as a mandate that 0x900 must be set for all but one line for each read. However, the problem is that if the mate of the primary alignment has flag 147 (i.e. 0x93 in hex), then 0x93 & 0x900 == 0, so that the above condition is not fulfilled. Perhaps the document means that one and only one alignment, rather than one and only one line?

lh3 commented 8 years ago

The spec is intended. Note that one read pair has two reads. Each read should have one and only one line with '(FLAG&0x900)==0'.

rob-p commented 8 years ago

@lh3 — Great; thanks for clearing this up! I'm so used to thinking of reads from the same fragment as "one thing", this distinction has escaped me.

rob-p commented 8 years ago

Hi @mikelove,

Can you check the new release v0.2.1 and see if this resolves the problem you're encountering? I think I'm not interpreting the spec (more) correctly, thanks to @lh3! I also fixed another small bug that affected the reversal of the quality strings.

mikelove commented 8 years ago

With v0.2.1, the lines are:

ERR188021.24    99      ENSG00000254244 790     1       75M     =       845     130     TTGATGGGCTCTACTTCTGATCTTGGTCACTGTGAAAAAATCAAGAAGGCCTGTGGAAATTTTGGCATTCCATGT     @CCFDFFFHHHHDIJJIJJJIGJIJJHIIJIJHGGIJJJJCGIIJGHGGIHIJDGIJIEIIIIJFHIJJHHHHHE     NH:i:1
ERR188021.24    147     ENSG00000254244 845     1       75M     =       790     130     GAAATTTTGGCATTCCATGTGAACTTCGAGTAACATCTGCGCATAAAGGACCAGATGAAACTCTGAGGATTAAAG     HHHGIGGJIGGJIJIGGIJIIHHFCGJGIGFJIIJII~HGIIJIJJJJIHJHGGJIJJIJIJHHHHFFFFFFCCC     NH:i:1
ERR188021.25    99      ENSG00000198211 2205    1       75M     =       2354    224     CGCCCCGCGGCCTGAAGATGTCGGCCACCTTCATCGGCAACAGCACGGCCATCCAGGAGCTGTTCAAGCGCATCT     @@CFFDDDHFDHHEGAEEGGFCDEIGIJCFHHD@DH8BHFBFC@@@?>@B:>AACDCB@8AC::CDDD:8;555>     NH:i:1
ERR188021.25    147     ENSG00000198211 2354    1       75M     =       2205    224     GAGTTCACCGAGGCCGAGAGCAACATGAACGACCTGGTGTCCGAGTACCAGCAGTACCAGGACGCCACGGCCGAC     CC@CD@?DBADDBDCCC@CB@DDCCDEB:;ECHGGFF~ABBFBBBB?FAHEHEFD:DDEEDFF<DC8D?DDD?@@     NH:i:1
ERR188021.33    83      ENSG00000198727 952     1       75M     =       916     111     CGCCCACTAAGCCAATCACTTTATTGACTCCTAGCCGCAGACCTCCTCATTCTAACCTGAGTCGGAGGACAACCA     DBA9@@>@ACFDBEHEC=;EECHAE@7>GEIIIHFIG~IIFBFCJIIGHGHGHHEHGGIHGGGHFFHDDDDD<@@     NH:i:1
ERR188021.33    163     ENSG00000198727 916     1       75M     =       952     111     ATCCTCCATATATCCAAACAACAAAGCATAATATTTCGCCCACTAAGCCAATCACTTTATTGACTCCTAGCCGCA     B@@FFDFFGDHAFHIGBGIGGIGGHEGIIJDHJJJJHEIIJIFIGIGJIJGHGGGIGGGGGGHGHGGGIJIHHFF     NH:i:1

Looks more like it. And the mate pairing function in Bioconductor is happy.

Thanks for the quick fix!

mikelove commented 8 years ago

Also I just noticed in v0.2.1 the sequence is fixed (-1 of previous position) for the minus strand reads.

 > subseq(dna[["ENSG00000254244"]], 845, 845+74)
   75-letter "DNAString" instance
 seq: GAAATTTTGGCATTCCATGTGAACTTTGAGTAACATCTGCGCATAAAGGACCAGATGAAACTCTGAGGGTTAAAG
 > subseq(dna[["ENSG00000198211"]], 2354, 2354+74)
   75-letter "DNAString" instance
 seq: GAGTTCACCGAGGCCGAGAGCAACATGAACGACCTGGTGTCCGAGTACCAGCAGTACCAGGACGCCACGGCCGAG
 > subseq(dna[["ENSG00000198727"]], 952, 952+74)
   75-letter "DNAString" instance
 seq: CGCCCACTAAGCCAATCACTTTATTGACTCCTAGCCGCAGACCTCCTCATTCTAACCTGAATCGGAGGACAACCA
rob-p commented 8 years ago

This is good, right?

mikelove commented 8 years ago

Yes, all good on my end!