amplab / snap

Scalable Nucleotide Alignment Program -- a fast and accurate read aligner for high-throughput sequencing data
https://www.microsoft.com/en-us/research/project/snap/
Apache License 2.0
288 stars 66 forks source link

Low % of properly paired reads #125

Closed carlosmag closed 4 years ago

carlosmag commented 4 years ago

Hi,

I am getting low % of properly paired reads in snap paired comparatively to bwa mem (24.19% vs 99.34%). These are values for trimmed input. Without trimming, I get 0% properly paired reads in samtools flagstat after snap paired mapping.

test genome reference sequence

bolosky commented 4 years ago

Sorry I took so long to get back to you.

What’s going on is that lots of these reads align pretty poorly, and BWA is soft clipping off most of the read to get something that aligns. So, for instance, BWA produces alignments like this (and I didn’t go searching for a bad example, this was just the first place I looked):

ERR2653038.1773924 147 MTB_anc 2939577 60 122S29M = 2939577 -29

Where it clipped 122 of 151 bases to get the read to align.

The current version of SNAP won’t do this. It only soft clips where it’s supposed to do so based on the base call quality string. So, only half of the reads align, and so only about a quarter of the pairs have both halves aligning.

--Bill

From: Carlos Magalhães notifications@github.com Sent: Saturday, June 6, 2020 6:32 PM To: amplab/snap snap@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [amplab/snap] Low % of properly paired reads (#125)

Hi,

I am getting low % of properly paired reads in snap paired comparatively to bwa mem (24.19% vs 99.34%). These are values for trimmed input. Without trimming, I get 0% properly paired reads in samtools flagstat.

test genomehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ebi.ac.uk%2Fena%2Fdata%2Fview%2FERR2653038&data=02%7C01%7Cbolosky%40microsoft.com%7C16975a9f0938476a816008d80a829366%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637270903218099786&sdata=P2Wsgb18XcEjVOkmebI8mZPJrg6DRb80crICsfIVNOs%3D&reserved=0 reference sequencehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Ffigshare.com%2Farticles%2FGenome_of_the_inferred_most_recent_common_ancestor_of_the_Mycobacterium_tuberculosis_complex%2F11500980%2F1&data=02%7C01%7Cbolosky%40microsoft.com%7C16975a9f0938476a816008d80a829366%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637270903218099786&sdata=uQXzA%2F71nr94d4SunXGMN2n7CCYIU0CYs%2BwGDRMF6bM%3D&reserved=0

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Famplab%2Fsnap%2Fissues%2F125&data=02%7C01%7Cbolosky%40microsoft.com%7C16975a9f0938476a816008d80a829366%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637270903218099786&sdata=GbBji01m%2BKB7OnNzW8MHJRl0j1IDll%2BJGJ7EH4lZneM%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAHPTWOVGRWB3XMXEYWQ2YLRVLUZBANCNFSM4NWTETEA&data=02%7C01%7Cbolosky%40microsoft.com%7C16975a9f0938476a816008d80a829366%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637270903218109780&sdata=FAWbK3j98m%2Bde2G8lG9JDm4RxXTJ%2FakQyYnZ0xfAXtA%3D&reserved=0.

carlosmag commented 4 years ago

Still, considering that there is a low % of genetic diversity between Mycobacterium tuberculosis strains, I would expect the great majority of the reads to align without soft clipping.

For instance, for input without sequencing adapters, bbmap managed to map 80% of the reads (73% properly paired) in its perfect mode (no substitutions or indels allowed relatively to the reference) and using global alignment only. Parameters were local=f perfectmode=t fast=t. If you want, I can provide a link for trimmed fastq files.

Thanks for your time.

bolosky commented 4 years ago

I looked at a random section of output from bwa-mem2 output, and 10/10 reads had significant clipping: 122S29M 87M64S 64S87M 85M66S 77M74S 53M98S 91M60S 66S85M 74S77M 98S53M

Some of these have a bunch of obvious garbage in the reads. For instance, the read for the first one is CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTTTTTTTTTTAATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTAGGCCCGCGGCCAGCGCCGCGTCCAGGTG, the first part of which is clearly fishy, especially for a prokaryote, but BWA clipped much more than just the repeated bases. BWA actually aligned AGGCCCGCGGCCAGCGCCGCGTCCAGGTG.

Looking at a couple of other randomly selected sections, it’s not actually 100% with lots of the read clipped, but it’s a large fraction of them.

I’d be happy to look at the trimmed FASTQ.

Maybe we should think about adding aggressive soft clipping for reads that don’t align, but I’ve been somewhat reluctant to do that because it just seems so arbitrary.

Do you have an opinion, Arun?

--Bill

From: Carlos Magalhães notifications@github.com Sent: Thursday, June 18, 2020 5:04 PM To: amplab/snap snap@noreply.github.com Cc: Bill Bolosky bolosky@microsoft.com; Comment comment@noreply.github.com Subject: Re: [amplab/snap] Low % of properly paired reads (#125)

Still, considering that there is a low % of genetic diversity between Mycobacterium tuberculosis strains, I would expect the great majority of the reads to align without soft clipping.

For instance, for trimmed input, bbmaphttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fjgi.doe.gov%2Fdata-and-tools%2Fbbtools%2Fbb-tools-user-guide%2Fbbmap-guide%2F&data=02%7C01%7Cbolosky%40microsoft.com%7Ce425695869a94e65161f08d813e436dc%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637281218189129180&sdata=VZiaU%2FiqOYI63ClRpPJ5m1%2FpcoiimAmtbLZrrirMPwA%3D&reserved=0 managed to map 80% of the reads (73% properly paired) in its perfect mode (no substitutions or indels allowed relatively to the reference) and using global alignment only. Parameters were local=f perfectmode=t fast=t. If you want, I can provide a link for trimmed fastq files.

Thanks for your time.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Famplab%2Fsnap%2Fissues%2F125%23issuecomment-646363919&data=02%7C01%7Cbolosky%40microsoft.com%7Ce425695869a94e65161f08d813e436dc%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637281218189139173&sdata=W9BVT%2Bz1acBSP7SosqKn7XNZZxntzc7VVeMdqcqyyfk%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAHPTWPPDUUUUKATVWJZXPLRXKTNRANCNFSM4NWTETEA&data=02%7C01%7Cbolosky%40microsoft.com%7Ce425695869a94e65161f08d813e436dc%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637281218189139173&sdata=4OVgRgT0xBN3ioGiMudorx66dB9WSzc9IgPU9JSy%2BH0%3D&reserved=0.

carlosmag commented 4 years ago

Thanks for your insights! I might need to look for another genome to test.

Trimmed fastq files are here. You might have to right click them for regular download...

bolosky commented 4 years ago

I ran the trimmed version and SNAP mapped about 90% of the reads with MAPQ >= 10 and 63.10% paired (which to SNAP means it used the paired-end aligner and didn’t have to fall back to aligning the reads using the single-end aligner; SNAP is really two completely different aligners).

6% of the total reads were rejected because they’re too short and/or had too many Ns in them. Also, SNAP has a default hard min/max for paired-end spacing of 1 to 1000 bases.

I told SNAP to align reads down to 30 bases in length (-mrl 30) and to use a paired-end spacing range of 0 to 5000 (-s 0 5000) bases and it aligned 94.5% of reads at MAPQ >= 10, and only skipped 0.98% for being too short or having too many Ns. It did 71.67% with the paired-end aligner.

I’m running a newer version of SNAP than you have, it’s a version of a new release that we’re working on with tons of improvements, so you won’t get these exact results, but yours shouldn’t be that different.

d:\temp>D:\sequence\snap-affineGap3\obj\bin\Release\x64\snap.exe paired mtb_ancestor-20 -o ERR2653038_trimmed.bam -so -sm 60 ERR2653038_R1_bbduk.fastq.gz ERR2653038_R2_bbduk.fastq.gz -s 0 5000 -mrl 30 Welcome to SNAP version 1.0dev.101 (affine).3.

Loading index from directory... 0s. 4412532 bases, seed size 20 Aligning. sorting...125: MergeSortThread 33228, deleteSortBlockVector 0 12782: Thread 33228 read 1394ms, write 8109ms sorted 7368336 reads in 16 blocks, 13 s read wait align 0.000 s + merge 0.001 s, read release align 0.000 s + merge 0.089 s write wait 3.439 s align + 0.189 s merge, write filter 18.230 s align + 3.677 s merge Total Reads Aligned, MAPQ >= 10 Aligned, MAPQ < 10 Unaligned Too Short/Too Many Ns %Pairs Reads/s Time in Aligner (s) 7,368,336 6,962,840 (94.50%) 86,969 (1.18%) 246,376 (3.34%) 72,151 (0.98%) 71.67% 344,218 21

--Bill

From: Carlos Magalhães notifications@github.com Sent: Thursday, June 18, 2020 6:04 PM To: amplab/snap snap@noreply.github.com Cc: Bill Bolosky bolosky@microsoft.com; Comment comment@noreply.github.com Subject: Re: [amplab/snap] Low % of properly paired reads (#125)

Thanks for your insights! I might need to look for another genome to test.

Trimmed fastq files are herehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmega.nz%2Ffolder%2FdERWDI6I%2337xE-kQd_NjtP06-xEyVAg&data=02%7C01%7Cbolosky%40microsoft.com%7Cc21bb27bc091486f7b8b08d813ec94fa%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637281254115118575&sdata=7pc27k1DD1SYU%2FyGMZE55yzBDDgUZAqExb0BTBZtumk%3D&reserved=0. You might have to right click them for regular download...

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Famplab%2Fsnap%2Fissues%2F125%23issuecomment-646378966&data=02%7C01%7Cbolosky%40microsoft.com%7Cc21bb27bc091486f7b8b08d813ec94fa%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637281254115128566&sdata=%2FGp7sdhb0T%2BYqsyg2eZU2qtYbUF0JoC3xf40OmRl19k%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAHPTWN5YF3DUODMXRZXSTLRXK2OFANCNFSM4NWTETEA&data=02%7C01%7Cbolosky%40microsoft.com%7Cc21bb27bc091486f7b8b08d813ec94fa%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637281254115138566&sdata=xyeIUq4ERn5yojrzonQXC%2FxZJVfNcN3loKlHn8romM4%3D&reserved=0.

carlosmag commented 4 years ago

SNAP has a default hard min/max for paired-end spacing of 1 to 1000 bases

Maybe the problem is that the genome was sequenced on Illumina MiSeq and that reads actually overlap?

In the meantime, I passed bwa mem output to samclip --max 0 to remove clipped alignments and got ≃ 90% mapped reads, but only ≃ 25% properly paired with SNAP version 1.0beta.18. This is roughly the same value as with clipped SAM input.

bolosky commented 4 years ago

That’s a really old version, from 2015, though looking at the changelog it’s not clear why it should be any different for the stuff you’re doing, mostly there have been just bugfixes since then.

Probably your best bet is to run the dev version, which you’ll have to build from source:

git clone https://github.com/amplab/snap.git cd snap git checkout dev make

You should get 1.0dev.102.

I just ran that version to make sure that the difference wasn’t in the new code that we’re working on, and got about the same results:

d:\temp> D:\sequence\snap-dev\obj\bin\Release\x64\snap.exe paired mtb_ancestor-20 -o ERR2653038_trimmed_dev102.bam -so -sm 60 ERR2653038_R1_bbduk.fastq.gz ERR2653038_R2_bbduk.fastq.gz -s 0 5000 -mrl 30 Welcome to SNAP version 1.0dev.102.

Loading index from directory... 0s. 4412532 bases, seed size 20 Aligning. sorting...sorted 7368336 reads in 16 blocks, 12 s read wait align 0.000 s + merge 0.001 s, read release align 0.000 s + merge 0.088 s write wait 2.030 s align + 0.188 s merge, write filter 19.527 s align + 3.285 s merge Total Reads Aligned, MAPQ >= 10 Aligned, MAPQ < 10 Unaligned Too Short/Too Many Ns %Pairs Reads/s Time in Aligner (s) 7,368,336 6,995,647 (94.94%) 92,596 (1.26%) 207,942 (2.82%) 72,151 (0.98%) 71.19% 381,541 19

So I’m not sure what’s happening with you. I didn’t do anything fancy for the index, either, it’s just the default parameters.

Did you use -mrl to tell it not to ignore short reads?

--Bill

From: Carlos Magalhães notifications@github.com Sent: Thursday, June 18, 2020 7:13 PM To: amplab/snap snap@noreply.github.com Cc: Bill Bolosky bolosky@microsoft.com; Comment comment@noreply.github.com Subject: Re: [amplab/snap] Low % of properly paired reads (#125)

SNAP has a default hard min/max for paired-end spacing of 1 to 1000 bases

Maybe the problem is that the genome was sequenced on Illumina MiSeq and that reads actually overlap?

In the meantime, I passed bwa mem output to samcliphttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftseemann%2Fsamclip&data=02%7C01%7Cbolosky%40microsoft.com%7C46157316478d4754b20a08d813f640ea%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637281295653990378&sdata=Qgw%2B5cmyHLYYRFzh932RV4hfZnQbx0n5kE0wlkczS5A%3D&reserved=0 --max 0 to remove clipped alignments and got ≃ 90% mapped reads, but only ≃ 25% properly paired. This is roughly the same value as the untrimmed input.

SNAP version 1.0beta.18.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Famplab%2Fsnap%2Fissues%2F125%23issuecomment-646395691&data=02%7C01%7Cbolosky%40microsoft.com%7C46157316478d4754b20a08d813f640ea%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637281295653995368&sdata=PDWs%2B1F8DkA2JNKFJhGAgJwUiinTk4vD0Va8WiSLLeY%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAHPTWKAACZ5LGILW45C6L3RXLCRZANCNFSM4NWTETEA&data=02%7C01%7Cbolosky%40microsoft.com%7C46157316478d4754b20a08d813f640ea%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637281295654000359&sdata=wjEBEK1gnB1%2F54z8tqHXK9fYi5JSjVkYTE3N0aTEzk0%3D&reserved=0.

carlosmag commented 4 years ago

Thanks for the detailed feedback!

I found the key parameter was spacing between paired-end reads. Setting -s 0 5000 increased % properly paired reads to ≃ 71% (≃ 97% mapped reads). -mrl 30 had negligible effect (≃2%). Still, there are about a quarter less properly paired reads comparatively to bwa mem.

Will test latest dev version and consider it for large scale analyses.

bolosky commented 4 years ago

I’m not sure focusing on properly paired is necessarily the right metric.

SNAP has a paired-end aligner that works by looking for places where there are seed hits for both ends of the read pair near to one another (which it does by intersecting the hit sets for the seeds it selects). When it finds such a place, it evaluates the possible alignments around there. If it can’t find a good (or any) alignment that way, then it feeds both ends of the pair to the single-end aligner and uses what it finds. SNAP marks a read pair as properly paired if and only if it uses the alignment from the paired-end aligner and doesn’t have to fall back on the single-end aligner. One thing that can affect this is that SNAP’s paired-end aligner will never find alignments that are FF or RR, it only finds FR and RF pairs. For a bacterial genome, especially one that’s not exactly from the same organism, you might expect to have inversions that will result in FF or RR pairs, which SNAP will never mark as “properly paired” but can still align correctly.

AFAIK other aligners pretty much always use a process like SNAP’s fallback, where they find the top few single-end alignments for each half of the pair using a single-end aligner and then choose among them, and then mark them as properly paired if they’re close to one another (or something; I’m not sure exactly what they do).

--B

From: Carlos Magalhães notifications@github.com Sent: Thursday, June 18, 2020 8:00 PM To: amplab/snap snap@noreply.github.com Cc: Bill Bolosky bolosky@microsoft.com; Comment comment@noreply.github.com Subject: Re: [amplab/snap] Low % of properly paired reads (#125)

Thanks for the detailed feedback!

I found the key parameter was spacing between paired-end reads. Setting -s 0 5000 increased % properly paired reads to ≃ 71% (≃ 97% mapped reads). -mrl 30 had negligible effect (≃2%). Still, there are about a quarter less properly paired reads comparatively to bwa mem.

Will test latest dev version and consider it for large scale analyses.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Famplab%2Fsnap%2Fissues%2F125%23issuecomment-646407596&data=02%7C01%7Cbolosky%40microsoft.com%7Cd0d38646f9a7421949b908d813fcd486%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637281323900721293&sdata=nEuP5JhOwudl89UHL1sfmRhoOski0TTcvbcw0YO%2Fbto%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAHPTWKE22L7GJVC43QVPCTRXLICJANCNFSM4NWTETEA&data=02%7C01%7Cbolosky%40microsoft.com%7Cd0d38646f9a7421949b908d813fcd486%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637281323900731286&sdata=mxe6ZOMCBBXNF71Kfza4RSacm20LrKe6Nb4ItLBnj5A%3D&reserved=0.

carlosmag commented 4 years ago

Thanks for the algorithmic description!

I am not aware of inversions in M. tuberculosis. Considering that some variant calling tools (such as Pilon) discard reads without 'properly paired flag', I was afraid of some sort of impact. I guess only further testing might tell the difference...