Closed mikelove closed 8 years ago
Hi @mikelove,
Indeed; this seems wrong. Since we don't compute a full alignment here, the notion of which alignment is "primary" for multimapping reads is somewhat arbitrary. However, both mates in a non-orphaned alignment (as these seem to be) should be marked as primary. I think this slipped detection thus far since we don't really end up making use of the SAM flags in Sailfish / Salmon (since we interface with RapMap directly and the relevant information is coded in the QuasiMapping
structure). I'm guessing that 147
should be the right flag for the first two of these alignments, right? I can fix this and cut a new version.
--Rob
I should confess I'm no expert on what the SAM flags should be. I always have to go to https://broadinstitute.github.io/picard/explain-flags.html
I only know when I get something discordant between mates because the Bioconductor pairing function (within GenomicAlignments's readGAlignmentPairs) gives me some indication. This function is picky about what it considers a "GAlignmentPair":
2 records rec1 and rec2 are considered mates iff all the following conditions are satisfied: – (A) QNAME(rec1) == QNAME(rec2) – (B) RNEXT(rec1) == RNAME(rec2) and RNEXT(rec2) == RNAME(rec1) – (C) PNEXT(rec1) == POS(rec2) and PNEXT(rec2) == POS(rec1) – (D) Flag bit 0x20 of rec1 == Flag bit 0x10 of rec2 and Flag bit 0x20 of rec2 == Flag bit 0x10 of rec1 – (E) rec1 corresponds to the first segment in the template and rec2 corresponds to the last segment in the template, OR, rec2 corresponds to the first segment in the template and rec1 corresponds to the last segment in the template – (F) rec1 and rec2 have same flag bit 0x2 – (G) rec1 and rec2 have same flag bit 0x100
where for these mates above G is not true.
I think my confusion / the problem is coming from this requirement in the SAM spec:
For each read/contig in a SAM file, it is required that one and only one line associated with the read satisfies ‘FLAG & 0x900 == 0’. This line is called the primary line of the read.
I was taking this as a mandate that 0x900
must be set for all but one line for each read. However, the problem is that if the mate of the primary alignment has flag 147
(i.e. 0x93
in hex), then 0x93 & 0x900 == 0
, so that the above condition is not fulfilled. Perhaps the document means that one and only one alignment, rather than one and only one line?
The spec is intended. Note that one read pair has two reads. Each read should have one and only one line with '(FLAG&0x900)==0'.
@lh3 — Great; thanks for clearing this up! I'm so used to thinking of reads from the same fragment as "one thing", this distinction has escaped me.
Hi @mikelove,
Can you check the new release v0.2.1 and see if this resolves the problem you're encountering? I think I'm not interpreting the spec (more) correctly, thanks to @lh3! I also fixed another small bug that affected the reversal of the quality strings.
With v0.2.1, the lines are:
ERR188021.24 99 ENSG00000254244 790 1 75M = 845 130 TTGATGGGCTCTACTTCTGATCTTGGTCACTGTGAAAAAATCAAGAAGGCCTGTGGAAATTTTGGCATTCCATGT @CCFDFFFHHHHDIJJIJJJIGJIJJHIIJIJHGGIJJJJCGIIJGHGGIHIJDGIJIEIIIIJFHIJJHHHHHE NH:i:1
ERR188021.24 147 ENSG00000254244 845 1 75M = 790 130 GAAATTTTGGCATTCCATGTGAACTTCGAGTAACATCTGCGCATAAAGGACCAGATGAAACTCTGAGGATTAAAG HHHGIGGJIGGJIJIGGIJIIHHFCGJGIGFJIIJII~HGIIJIJJJJIHJHGGJIJJIJIJHHHHFFFFFFCCC NH:i:1
ERR188021.25 99 ENSG00000198211 2205 1 75M = 2354 224 CGCCCCGCGGCCTGAAGATGTCGGCCACCTTCATCGGCAACAGCACGGCCATCCAGGAGCTGTTCAAGCGCATCT @@CFFDDDHFDHHEGAEEGGFCDEIGIJCFHHD@DH8BHFBFC@@@?>@B:>AACDCB@8AC::CDDD:8;555> NH:i:1
ERR188021.25 147 ENSG00000198211 2354 1 75M = 2205 224 GAGTTCACCGAGGCCGAGAGCAACATGAACGACCTGGTGTCCGAGTACCAGCAGTACCAGGACGCCACGGCCGAC CC@CD@?DBADDBDCCC@CB@DDCCDEB:;ECHGGFF~ABBFBBBB?FAHEHEFD:DDEEDFF<DC8D?DDD?@@ NH:i:1
ERR188021.33 83 ENSG00000198727 952 1 75M = 916 111 CGCCCACTAAGCCAATCACTTTATTGACTCCTAGCCGCAGACCTCCTCATTCTAACCTGAGTCGGAGGACAACCA DBA9@@>@ACFDBEHEC=;EECHAE@7>GEIIIHFIG~IIFBFCJIIGHGHGHHEHGGIHGGGHFFHDDDDD<@@ NH:i:1
ERR188021.33 163 ENSG00000198727 916 1 75M = 952 111 ATCCTCCATATATCCAAACAACAAAGCATAATATTTCGCCCACTAAGCCAATCACTTTATTGACTCCTAGCCGCA B@@FFDFFGDHAFHIGBGIGGIGGHEGIIJDHJJJJHEIIJIFIGIGJIJGHGGGIGGGGGGHGHGGGIJIHHFF NH:i:1
Looks more like it. And the mate pairing function in Bioconductor is happy.
Thanks for the quick fix!
Also I just noticed in v0.2.1 the sequence is fixed (-1 of previous position) for the minus strand reads.
> subseq(dna[["ENSG00000254244"]], 845, 845+74)
75-letter "DNAString" instance
seq: GAAATTTTGGCATTCCATGTGAACTTTGAGTAACATCTGCGCATAAAGGACCAGATGAAACTCTGAGGGTTAAAG
> subseq(dna[["ENSG00000198211"]], 2354, 2354+74)
75-letter "DNAString" instance
seq: GAGTTCACCGAGGCCGAGAGCAACATGAACGACCTGGTGTCCGAGTACCAGCAGTACCAGGACGCCACGGCCGAG
> subseq(dna[["ENSG00000198727"]], 952, 952+74)
75-letter "DNAString" instance
seq: CGCCCACTAAGCCAATCACTTTATTGACTCCTAGCCGCAGACCTCCTCATTCTAACCTGAATCGGAGGACAACCA
This is good, right?
Yes, all good on my end!
hi,
I ran the following, using RapMap 0.2.0:
And get the following output, where it seems that the second read in the pair always has the "not primary alignment" flag:
Maybe I'm missing something obvious!