COMBINE-lab / pufferfish

An efficient index for the colored, compacted, de Bruijn graph
GNU General Public License v3.0
107 stars 19 forks source link

Puffaligner doesn't map read pairs to different references #23

Open apcamargo opened 3 years ago

apcamargo commented 3 years ago

Hi,

There are some applications where it's important to identify reads pairs where the reads map to different references. Even though Puffaligner map reads independently ("(…) we consider the chaining and chain filtering for each end of the read separately."), I couldn't find any pair consisting of mates that map to different references.

In comparison, Bowtie2 maps ≈ 1.6% of the read pairs to different references with the same inputs.

fataltes commented 3 years ago

Hi @apcamargo ,

Thank you for your post. However, I am not sure if I understand the request clearly. Would you mind explaining a little bit more?

apcamargo commented 3 years ago

Sure, @fataltes!

Here's Puffaligner's (using --bestStrata) samtools flagstat output:

214688504 + 0 in total (QC-passed reads + QC-failed reads)
50488220 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
125913389 + 0 mapped (58.65% : N/A)
164200284 + 0 paired in sequencing
82100142 + 0 read1
82100142 + 0 read2
83360444 + 0 properly paired (50.77% : N/A)
83360444 + 0 with itself and mate mapped
6101721 + 0 singletons (3.72% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Here's Bowtie2's (using -k 15):

241492571 + 0 in total (QC-passed reads + QC-failed reads)
77292287 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
159016623 + 0 mapped (65.85% : N/A)
164200284 + 0 paired in sequencing
82100142 + 0 read1
82100142 + 0 read2
74243714 + 0 properly paired (45.22% : N/A)
77436030 + 0 with itself and mate mapped
4288306 + 0 singletons (2.61% : N/A)
2489036 + 0 with mate mapped to a different chr
2027014 + 0 with mate mapped to a different chr (mapQ>=5)

Puffaligner's with mate mapped to a different chr is 0, meaning that there are no pairs with reads that mapped to different references.

Essentially, I'm interest in alignments where the 7th field is not =, for example:

HISEQ13:355:CBN0FANXX:7:1101:17319:1971 97  k147_2000503    17  38  150M    k147_584177 66  0   CGGCGGACTAAGGCTCTATAATTTCAATTTTTCACCAGACTAAGTAATCCATGAAGAAACTCATTGCAGCACTGGCTTCCAGTGTTCTGGTGATGTCCGCCGCCGTCGCCCAGACGCTGCCGGCGCCGACCATCGCCGCCAAATCGTGGC  =ABBGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGFGG>GEGGGGGGDGFGGCGDGDGGGGG<DGGGGGGGBGGGGGGGGGGGGGGGGGGGGGGGGGG@  AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:150    YT:Z:UP