amplab / snap

Scalable Nucleotide Alignment Program -- a fast and accurate read aligner for high-throughput sequencing data
https://www.microsoft.com/en-us/research/project/snap/
Apache License 2.0
287 stars 66 forks source link

bug (?): unexpected results with enforced interval size greater than default #162

Open eboyden opened 1 year ago

eboyden commented 1 year ago

When I aligned a pair of fastqs using BWA-MEM2, Bowtie2, or SNAP, and an enforced interval size of 0-500, I got similar results. When I aligned with BWA-MEM2 or Bowtie2 and an enforced interval size of 50-500, the results didn't change much; very few on-target alignments had intervals <50 bp. But when I aligned with SNAP and an enforced (using -fs) interval size of 50-500, many fewer reads aligned. Furthermore, here is an example of a read pair that no longer aligned with a minimum interval of 50 bp, despite the fact that it's a 120 bp properly paired interval:

MN01688:12:000H3MG5N:1:23106:10622:10676        99      chr1    215878739       70      119M    =       215878739       120     AAAAATCATAGTCACCTTCTCTTACCTCAAATTAGGTCCATTTGGCTTGGATGGTGGTTGCCAAGAAATCACAACATATGATTCACTTAGTGGAATCACAGACAATGGGCCAACATTCT FFFFFFFFFFFFFFFFFFFFFFFFFFFFAFFFFFFFFFFFFFFFFAFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF LB:Z:lb        MC:Z:120M       MD:Z:119        PG:Z:SNAP       RG:Z:rg   PL:Z:ILLUMINA   NM:i:0  SM:Z:sm        MQ:i:70 UQ:i:0  QS:i:4389       PU:Z:pu   ms:i:4389
MN01688:12:000H3MG5N:1:23106:10622:10676        147     chr1    215878739       70      120M    =       215878739       -120    AAAAATCATAGTCACCTTCTCTTACCTCAAATTAGGTCCATTTGGCTTGGATGGTGGTTGCCAAGAAATCACAACATATGATTCACTTAGTGGAATCACAGACAATGGGCCAACATTCTG        FFFFFFFFFFFFFFFFFFFF=FFFFFFFFFFAFFFFFFFFFFFFFFFFFFFFFFFFFF/FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF        LB:Z:lb        MC:Z:119M       MD:Z:120        PG:Z:SNAP       RG:Z:rg   PL:Z:ILLUMINA   NM:i:0  SM:Z:sm        MQ:i:70 UQ:i:0  QS:i:4388       PU:Z:pu   ms:i:4388

I also recapitulated this on the NA12878 data NIST7035_TAAGGCGA; the alignment rate fell from 97% to 59%. And it only seems to happen with the -fs option.

samtools flagstat reports: -s 0 500:

39731146 + 0 in total (QC-passed reads + QC-failed reads)
39731146 + 0 primary
0 + 0 secondary
0 + 0 supplementary
3114693 + 0 duplicates
3114693 + 0 primary duplicates
39079698 + 0 mapped (98.36% : N/A)
39079698 + 0 primary mapped (98.36% : N/A)
39731146 + 0 paired in sequencing
19865573 + 0 read1
19865573 + 0 read2
38521349 + 0 properly paired (96.96% : N/A)
38615594 + 0 with itself and mate mapped
464104 + 0 singletons (1.17% : N/A)
19728 + 0 with mate mapped to a different chr
14333 + 0 with mate mapped to a different chr (mapQ>=5)

-s 0 500 -fs (similar results):

39731146 + 0 in total (QC-passed reads + QC-failed reads)
39731146 + 0 primary
0 + 0 secondary
0 + 0 supplementary
2899310 + 0 duplicates
2899310 + 0 primary duplicates
38627431 + 0 mapped (97.22% : N/A)
38627431 + 0 primary mapped (97.22% : N/A)
39731146 + 0 paired in sequencing
19865573 + 0 read1
19865573 + 0 read2
38533349 + 0 properly paired (96.99% : N/A)
38556972 + 0 with itself and mate mapped
70459 + 0 singletons (0.18% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

-s 50 500 (similar results):

39731146 + 0 in total (QC-passed reads + QC-failed reads)
39731146 + 0 primary
0 + 0 secondary
0 + 0 supplementary
3144369 + 0 duplicates
3144369 + 0 primary duplicates
39074551 + 0 mapped (98.35% : N/A)
39074551 + 0 primary mapped (98.35% : N/A)
39731146 + 0 paired in sequencing
19865573 + 0 read1
19865573 + 0 read2
23181753 + 0 properly paired (58.35% : N/A)
38607952 + 0 with itself and mate mapped
466599 + 0 singletons (1.17% : N/A)
333828 + 0 with mate mapped to a different chr
47673 + 0 with mate mapped to a different chr (mapQ>=5)

-s 50 500 -fs (abnormally low alignment rate):

39731146 + 0 in total (QC-passed reads + QC-failed reads)
39731146 + 0 primary
0 + 0 secondary
0 + 0 supplementary
1658040 + 0 duplicates
1658040 + 0 primary duplicates
23583000 + 0 mapped (59.36% : N/A)
23583000 + 0 primary mapped (59.36% : N/A)
39731146 + 0 paired in sequencing
19865573 + 0 read1
19865573 + 0 read2
23344882 + 0 properly paired (58.76% : N/A)
23512538 + 0 with itself and mate mapped
70462 + 0 singletons (0.18% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

For comparison: Bowtie2 --very-fast-local -I 0 --no-discordant (default):

39731146 + 0 in total (QC-passed reads + QC-failed reads)
39731146 + 0 primary
0 + 0 secondary
0 + 0 supplementary
2509954 + 0 duplicates
2509954 + 0 primary duplicates
39685464 + 0 mapped (99.89% : N/A)
39685464 + 0 primary mapped (99.89% : N/A)
39731146 + 0 paired in sequencing
19865573 + 0 read1
19865573 + 0 read2
39514079 + 0 properly paired (99.45% : N/A)
39653236 + 0 with itself and mate mapped
32228 + 0 singletons (0.08% : N/A)
13340 + 0 with mate mapped to a different chr
8318 + 0 with mate mapped to a different chr (mapQ>=5)

Bowtie2 --very-fast-local -I 50 --no-discordant:

39731146 + 0 in total (QC-passed reads + QC-failed reads)
39731146 + 0 primary
0 + 0 secondary
0 + 0 supplementary
2510277 + 0 duplicates
2510277 + 0 primary duplicates
39685267 + 0 mapped (99.88% : N/A)
39685267 + 0 primary mapped (99.88% : N/A)
39731146 + 0 paired in sequencing
19865573 + 0 read1
19865573 + 0 read2
39382600 + 0 properly paired (99.12% : N/A)
39652842 + 0 with itself and mate mapped
32425 + 0 singletons (0.08% : N/A)
19408 + 0 with mate mapped to a different chr
8593 + 0 with mate mapped to a different chr (mapQ>=5)
eboyden commented 1 year ago

Reviewing this a few months later, I noticed that although the alignment rate with -s 50 500 is high (98.35%), the "properly paired" rate is low (only 58.35%), which suggests that nearly half of the alignments have short intervals. So adding -fs to the command is simply preventing those "improperly paired" alignments from occurring. So the problem seems to be with the minimum insert size, not the -fs option.

bolosky commented 1 year ago

I wonder if SNAP computes the interval differently than the other aligners. I'll take a look at it and see what's up.

eboyden commented 1 year ago

Yeah I think you're right - I just took another look at alignments with either -s 0 500 -fs or -s 50 500 -fs, and in the latter case not only did the number of alignments with intervals >=50 sharply decrease, I also still observed plenty of alignments with intervals <50. Note that I'm basing interval size on column 9 of the sam file (TLEN).

bolosky commented 1 year ago

I'll figure it out and try to harmonize what we're doing with the other aligners.