amplab / snap

Scalable Nucleotide Alignment Program -- a fast and accurate read aligner for high-throughput sequencing data
https://www.microsoft.com/en-us/research/project/snap/
Apache License 2.0
287 stars 66 forks source link

Documentation out of date #84

Closed MrOlm closed 7 years ago

MrOlm commented 7 years ago

Hello,

I'm trying to understand the output created by SNAP after mapping. For example:

Welcome to SNAP version 1.0beta.23.

Loading index from directory... 11s.  3106195646 bases, seed size 20
Aligning.
Total Reads    Aligned, MAPQ >= 10    Aligned, MAPQ < 10     Unaligned              Too Short/Too Many Ns  Filtered               %Pairs    Reads/s   Time in Aligner (s)
33,913,768     25,741,236 (75.90%)    6,184,244 (18.24%)     1,513,736 (4.46%)      0 (0.00%)              474,552 (1.40%)        63.82%    153,476   221

The documentation does describe the output statistics (at the very end of the document), but they are different columns than are used in current program.

Specifically I am wondering what the "Filtered" column means in the current output statistics?

Thank you, -Matt

bolosky commented 7 years ago

It’s the number of reads that didn’t get written because they failed the output filter, which you get if you specify –F or –E. Judging from the rest of your output, which shows 0 too short/too many Ns, you probably specified –F l (Or equivalently –E smu).

And you’re right, I should update the documentation. This is far from the only thing that’s out of date in it.

--Bill

From: Matt Olm [mailto:notifications@github.com] Sent: Thursday, February 23, 2017 10:08 AM To: amplab/snap snap@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [amplab/snap] Documentation out of date (#84)

Hello,

I'm trying to understand the output created by SNAP after mapping. For example:

Welcome to SNAP version 1.0beta.23.

Loading index from directory... 11s. 3106195646 bases, seed size 20

Aligning.

Total Reads Aligned, MAPQ >= 10 Aligned, MAPQ < 10 Unaligned Too Short/Too Many Ns Filtered %Pairs Reads/s Time in Aligner (s)

33,913,768 25,741,236 (75.90%) 6,184,244 (18.24%) 1,513,736 (4.46%) 0 (0.00%) 474,552 (1.40%) 63.82% 153,476 221

The documentation does describe the output statistics (at the very end of the document), but they are different columns than are used in current program.

Specifically I am wondering what the "Filtered" column means in the current output statistics?

Thank you, -Matt

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Famplab%2Fsnap%2Fissues%2F84&data=02%7C01%7Cbolosky%40microsoft.com%7C17d66e38a36447109c9b08d45c16dab8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636234700608461050&sdata=RJA6uaIBBR5qv9uOTRqo%2FY6U84pmUGWrfMzrPwKBvaE%3D&reserved=0, or mute the threadhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAA752fYodx59LR6_ZKIq72UGUDxCtwZ3ks5rfcrogaJpZM4MKTcr&data=02%7C01%7Cbolosky%40microsoft.com%7C17d66e38a36447109c9b08d45c16dab8%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636234700608461050&sdata=04XyrItParaTQZjU5kJmufFfFuvnBJy7snRDn18V2wM%3D&reserved=0.

MrOlm commented 7 years ago

Hi Bill,

Thank you for the quick reply. My command was as follows:

snap-aligner paired all_pangenomes_alpha.fasta.fa b003-d011.r1.fq.gz b003-d011.r2.fq.gz -t 48 -F a -o -sam -

So I was using the -F flag, but only filtering out those that don't align. So in that case I would expect the "Filtered" and the "Unaligned" to be the same- no?

Thanks again, -Matt

bolosky commented 7 years ago

No. The total of the different classes always add up to the total number of reads, so no read will ever be in two different columns. What’s happening in your case is that the reads in the “unaligned” column have mates that aligned, so they’re not getting filtered, while the reads in the “filtered” column had both ends unmapped. The “unaligned” ones get written into the output file while the “filtered” ones don’t.

If SNAP’s paired-end aligner can’t find an alignment for a read pair, then it sends both ends to the single-end aligner. If only one of those aligns, then it’ll write them both with one unaligned, but it won’t trip the –F a filter because that would cause you to lose the one aligned read.

You can also say that the filter condition must apply to both ends of a read by adding –F b along with –F a (or doing the equivalent thing with –E, which would be –E smxb or –E smb depending on whether you want to keep reads that were too short or had too many Ns, but your input doesn’t seem to have any of those so it doesn’t matter for you).

There’s no way to tell SNAP to emit only one end of a mate pair when the other end doesn’t align, which I guess there probably should be.

--B

From: Matt Olm [mailto:notifications@github.com] Sent: Thursday, February 23, 2017 5:48 PM To: amplab/snap snap@noreply.github.com Cc: Bill Bolosky bolosky@microsoft.com; Comment comment@noreply.github.com Subject: Re: [amplab/snap] Documentation out of date (#84)

Hi Bill,

Thank you for the quick reply. My command was as follows:

snap-aligner paired all_pangenomes_alpha.fasta.fa b003-d011.r1.fq.gz b003-d011.r2.fq.gz -t 48 -F a -o -sam -

So I was using the -F flag, but only filtering out those that don't align. So in that case I would expect the "Filtered" and the "Unaligned" to be the same- no?

Thanks again, -Matt

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Famplab%2Fsnap%2Fissues%2F84%23issuecomment-282179728&data=02%7C01%7Cbolosky%40microsoft.com%7Cfb5d198966384e67e92708d45c57390e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636234977058967688&sdata=BVGpGV2yp9m5eEU7CEgde4I1DyTuPZi1lAevmXxJV3c%3D&reserved=0, or mute the threadhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAA752VAEp7rGzOqyUacG23kh38iGEmTDks5rfjbogaJpZM4MKTcr&data=02%7C01%7Cbolosky%40microsoft.com%7Cfb5d198966384e67e92708d45c57390e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636234977058967688&sdata=dgq8Zl4wcuTyOgpkLdJFv9YOESobBOwIXRGoGAJQBfw%3D&reserved=0.

MrOlm commented 7 years ago

This has answered my question completely. Thanks again!