amplab / snap

Scalable Nucleotide Alignment Program -- a fast and accurate read aligner for high-throughput sequencing data
https://www.microsoft.com/en-us/research/project/snap/
Apache License 2.0
288 stars 66 forks source link

feature request: customizable quality threshold for -C #161

Closed eboyden closed 1 year ago

eboyden commented 2 years ago

Self-explanatory. Being able to trim low-quality bases from the front and back of reads is nice, but ideally the user would be able to determine what "low-quality" means, especially since the manner in which different sequencing platforms bin quality scores may affect the behavior.

bolosky commented 2 years ago

This is 2.0.2.dev.8. Please try it out and see if it does what you wanted.

eboyden commented 2 years ago

For the most part it does, thanks. Note that I'm running SNAP on a Linux server, so specifying Phred qualities as ASCII characters was a bit tricky, but using backslashes to interpret them literally worked (e.g. -cc \#\/) - it might be worth mentioning this in the documentation.

I did see a few cases where it didn't seem to take, not sure why. When using -C-+ -cc \#\/, most alignments soft-clipped the trailing / as so:

MN01688:12:000H3MG5N:1:13102:11696:12762 163 chr1 185721 29 119M1S = 186012 410 GAGACAGCGGCGGTTTGAGGAGCCACCTCCCAGCCACCTCGGGGCCAGGGCCAGGGTGTGCAGCACCACTGTACTATGGGGAAACTGGCCCAGAGAGGTGAGGCAGCTTGCCTGGGGTCA FFFAFFFFFFFFFFFFFFFFAFFAAFFFFFAFFAFFFFFAFFFFFFFFFFFFFFFFFFFFFFFF/FFFFFFAFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAFF/

But occasionally they did not:

MN01688:12:000H3MG5N:1:11106:26456:13326 83 chr1 377521 3 120M = 377400 -241 TTGCTGAATGTTAATTCAGAAATGAAATTAAAATTTTAAATTAACAACAAGCAACTTTACAAGAGGAAAAAAAAAAACCTCATTTCCTCCCCACAAAGCCACCTCATGAGCCTGGGTGGT FFFF/FFF/A//FAAFFFA/FFF/FFAF/FF/FFFAFFFAFFFFA//FFAFFFF6/FFAF//FAFFFFF66F6FF/F/F/A6FFFFFF/FFFFFFA/AFF/FFFFFFAFFFFFFFFAFF/

bolosky commented 2 years ago

It clips on input processing. The read that you show as an example is mapped RC (relative to the input), so it didn't clip because the base call score at the end of the read was F before it reversed and you did -C-+

I'll update the documentation to mention the need to quote the characters in Unix. You can also just put them in single quotes, so '#/' instead of #\/.

--Bill

From: Eric Boyden @.> Sent: Tuesday, October 11, 2022 9:41 PM To: amplab/snap @.> Cc: Bill Bolosky @.>; Comment @.> Subject: Re: [amplab/snap] feature request: customizable quality threshold for -C (Issue #161)

For the most part it does, thanks. Note that I'm running SNAP on a Linux server, so specifying Phred qualities as ASCII characters was a bit tricky, but using backslashes to interpret them literally worked (e.g. -cc #\/) - it might be worth mentioning this in the documentation.

I did see a few cases where it didn't seem to take, not sure why. When using -C-+ -cc #\/, most alignments soft-clipped the trailing / as so:

MN01688:12:000H3MG5N:1:13102:11696:12762 163 chr1 185721 29 119M1S = 186012 410 GAGACAGCGGCGGTTTGAGGAGCCACCTCCCAGCCACCTCGGGGCCAGGGCCAGGGTGTGCAGCACCACTGTACTATGGGGAAACTGGCCCAGAGAGGTGAGGCAGCTTGCCTGGGGTCA FFFAFFFFFFFFFFFFFFFFAFFAAFFFFFAFFAFFFFFAFFFFFFFFFFFFFFFFFFFFFFFF/FFFFFFAFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAFF/

But occasionally they did not:

MN01688:12:000H3MG5N:1:11106:26456:13326 83 chr1 377521 3 120M = 377400 -241 TTGCTGAATGTTAATTCAGAAATGAAATTAAAATTTTAAATTAACAACAAGCAACTTTACAAGAGGAAAAAAAAAAACCTCATTTCCTCCCCACAAAGCCACCTCATGAGCCTGGGTGGT FFFF/FFF/A//FAAFFFA/FFF/FFAF/FF/FFFAFFFAFFFFA//FFAFFFF6/FFAF//FAFFFFF66F6FF/F/F/A6FFFFFF/FFFFFFA/AFF/FFFFFFAFFFFFFFFAFF/

- Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Famplab%2Fsnap%2Fissues%2F161%23issuecomment-1275579268&data=05%7C01%7Cbolosky%40microsoft.com%7C131ba596aa594809eae408daac0bf124%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638011464534889707%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=GrawEg5ZTurTvUG1wp7Cw28nvqAISnWS555cmHnnMBM%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAHPTWOFSIMQDQBA6YFVDWTWCY6NHANCNFSM6AAAAAAQ5BHJHM&data=05%7C01%7Cbolosky%40microsoft.com%7C131ba596aa594809eae408daac0bf124%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638011464534889707%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=8OYHM6aeX%2Bm%2FgUtY6aYQ0IlbxvT8iOp2bgLJSc6LOgo%3D&reserved=0. You are receiving this because you commented.Message ID: @.**@.>>

eboyden commented 2 years ago

Of course, silly me, thanks for clarifying. Incidentally, a single quote is Phred score 6; presumably that would need to be backslashed?

bolosky commented 2 years ago

Yes, you have to use a backslash to get a single quote.

From: Eric Boyden @.> Sent: Tuesday, October 11, 2022 10:06 PM To: amplab/snap @.> Cc: Bill Bolosky @.>; Comment @.> Subject: Re: [amplab/snap] feature request: customizable quality threshold for -C (Issue #161)

Of course, silly me, thanks for clarifying. Incidentally, a single quote is Phred score 6; presumably that would need to be backslashed?

- Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Famplab%2Fsnap%2Fissues%2F161%23issuecomment-1275593574&data=05%7C01%7Cbolosky%40microsoft.com%7C5cd729cb534f4ee86b0508daac0f69e4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638011479444198055%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=hBBrCoUmua4ZSZ%2FLVD7tG39E%2FZl4TkrOie72j4Hy7%2B4%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAHPTWIBSKXTKNCDHINL3VLWCZBKNANCNFSM6AAAAAAQ5BHJHM&data=05%7C01%7Cbolosky%40microsoft.com%7C5cd729cb534f4ee86b0508daac0f69e4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638011479444198055%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Y9BHarWKMTZmPocTZxTBNSK2EdD5FSRF9t11pkOzhP8%3D&reserved=0. You are receiving this because you commented.Message ID: @.**@.>>

bolosky commented 1 year ago

In 2.0.2