dcjones / quip

Compressing next-generation sequencing data with extreme prejudice.
http://www.cs.washington.edu/homes/dcjones/quip/
BSD 3-Clause "New" or "Revised" License
78 stars 10 forks source link

Fix mate sequence when reading sam/bam #31

Open jbedo opened 2 years ago

jbedo commented 2 years ago

Previously upon reading the case of tid == mtid was detected and the sequence name mapped to "=". This causes missing sequence name errors upon decompression. As the case of tid == mtid is handled during writing of sam/bam, this patch simply records the full mate sequence name, resolving the matching issues.

Example read after decompression pre patch:

SL1344_1_530_0:0:0_0:0:0_6c9    163     SL1344  1       60      70M     *       461     530     AGAGATTACGTCTGGTTGCAAGAGATCATGACAGGGGGAATTGGTTGAAAATAAATATATCGCCAGCAGC  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII       MQ:i:60 AS:i:70 RG:Z:mysample1  NM:i:0  MC:Z:70M        MD:Z:70 ms:i:2800       XS:i:0

and post patch:

SL1344_1_530_0:0:0_0:0:0_6c9    163     SL1344  1       60      70M     =       461     530     AGAGATTACGTCTGGTTGCAAGAGATCATGACAGGGGGAATTGGTTGAAAATAAATATATCGCCAGCAGC  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII       MQ:i:60 AS:i:70 RG:Z:mysample1  NM:i:0  MC:Z:70M        MD:Z:70 ms:i:2800       XS:i:0