amplab / snap

Scalable Nucleotide Alignment Program -- a fast and accurate read aligner for high-throughput sequencing data
https://www.microsoft.com/en-us/research/project/snap/
Apache License 2.0
287 stars 66 forks source link

-so option seems to sort in reverse order #131

Closed haiqu closed 2 years ago

haiqu commented 3 years ago

Version 1.0.0 code, Windows 64-bit.

Today I created a GRCh38 file from Dante Labs .fasta input, and I applied the sort option. On checking the files at https://qual.iobio.io/ I find that there are two strange issues compared to the hs37d5 files supplied by Dante.

1) Chromosomes appear in the reverse order, with ChrY at top and Chr1 at bottom. 2) The first column of the report is full of rubbish.

Not sure whether the second issue is at their end, but the first almost certainly isn't.

WGStoHG38-GRCh38-Qual-iobio

haiqu commented 3 years ago

Additional data:

1) Attempting to examine the files at https://bam.iobio.io/ failed unexpectedly. 2) Attempting to use the files in WGSExtract failed unexpectedly.

bolosky commented 3 years ago

The "rubbush" is because there are spaces or tabs in the headers in the FASTA file that you built the index from, and SNAP v1.0.0 converts them to underscores and includes them in the contig name by default. When you build the index you need to tell it to treat spaces as separators by using the -bSpace option. This is probably also the cause of the unexpected failures of the tools to process the results, though I can't say for sure until you try it.

Or rebuild the index with v1.0.2. I changed the default for index build to have -bSpace on, because I kept doing what you did and the default behavior was confusing.

In terms of the chromosomes coming out in a different order from the FASTA, there's no requirement in the spec that order of chromosomes be preserved, at least as far as I've seen. Do you know of downstream tools that really need it? The SAM/BAM spec just says that the order needs to be the same as the order of the @SQ lines in the header.

From: Rob Judd notifications@github.com Sent: Thursday, December 17, 2020 5:55 AM To: amplab/snap snap@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [amplab/snap] -so option seems to sort in reverse order (#131)

Version 1.0.0 code, Windows 64-bit.

Today I created a GRCh38 file from Dante Labs .fasta input, and I applied the sort option. On cheking the files at https://qual.iobio.io/https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fqual.iobio.io%2F&data=04%7C01%7Cbolosky%40microsoft.com%7Cdd56a88103d3425a866d08d8a293530f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637438100931313719%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=uNxGyegTEOYkCSMc2outQoXQXrcVnCJOwnzK4Ei3eGs%3D&reserved=0 I find that there are two strange issues compared to the hs37d5 files supplied by Dante.

  1. Chromosomes appear in the reverse order, with ChrY at top and Chr1 at bottom.
  2. The first column of the report is full of rubbish.

Not sure whether the second issue is at their end, but the first almost certainly isn't.

[WGStoHG38-GRCh38-Qual-iobio]https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fuser-images.githubusercontent.com%2F13131219%2F102496213-3bc16180-40cb-11eb-91b6-632d5ad66b7d.jpg&data=04%7C01%7Cbolosky%40microsoft.com%7Cdd56a88103d3425a866d08d8a293530f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637438100931323715%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=RDplgxPmA58vBBg7KkhkNhtKE6%2FS4HBYp0qMGXK3uMA%3D&reserved=0

- You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Famplab%2Fsnap%2Fissues%2F131&data=04%7C01%7Cbolosky%40microsoft.com%7Cdd56a88103d3425a866d08d8a293530f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637438100931323715%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=zWr8DVDQGoBFaCmSGXtW5ceSbMCN5GDg1NIm1zYGAMM%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAHPTWOJGURX5EC62VICYILSVIESVANCNFSM4U7WJY6A&data=04%7C01%7Cbolosky%40microsoft.com%7Cdd56a88103d3425a866d08d8a293530f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637438100931333709%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Y%2F8v3PpDb2o1uy8xqpydThfkRCZERc%2FkbQ6D73nZ9Bs%3D&reserved=0.

bolosky commented 3 years ago

I looked at the code, and it's not all that hard to make the contigs come out in the same order as the FASTA with one exception. The exception is that all the ALT contigs have to come after all of the non-ALT contigs.

Internally, SNAP treats the genome as a single namespace. It keeps one 64 bit number for a genome location rather than contig:pos, and then converts it to contig:pos on output. To tell if a location is in an ALT contig it just compares its value to the beginning of the ALT contigs. This is much faster than having to go and look up a data structure ever time it's trying to make the determination if something is an ALT. This wouldn't work if the ALTs came in an arbitrary order.

So...if there's a good reason why the contigs need to be in the same order as in the FASTA, but it will still work if the ALTs are reordered to the end, let me know and I'll fix it. It'll take a half hour to do. If they really need to be in the same order as the FASTA including the ALTs, let me know that, too, and I'll think about it, but that's way more work to change.

--Bill

From: Rob Judd notifications@github.com Sent: Thursday, December 17, 2020 6:01 AM To: amplab/snap snap@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [amplab/snap] -so option seems to sort in reverse order (#131)

Additional data:

  1. Attempting to examine the files at https://bam.iobio.io/https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbam.iobio.io%2F&data=04%7C01%7Cbolosky%40microsoft.com%7C7b1f8ec7db9b4e7f950808d8a29435ed%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637438104737558030%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=k5BWTkDf4JA%2F%2Fhd7TFKV%2FU7VfpWrXaNR%2FqYxvWAJoIs%3D&reserved=0 failed unexpectedly.
  2. Attempting to use the files in WGSExtract failed unexpectedly.

- You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Famplab%2Fsnap%2Fissues%2F131%23issuecomment-747455841&data=04%7C01%7Cbolosky%40microsoft.com%7C7b1f8ec7db9b4e7f950808d8a29435ed%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637438104737558030%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=I%2FFmieVzkQ3n%2F%2Faf1dTDOJDMLz3172ezIhIYOQmWTpA%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAHPTWLWEWNNCBGYAUK2QNLSVIFKPANCNFSM4U7WJY6A&data=04%7C01%7Cbolosky%40microsoft.com%7C7b1f8ec7db9b4e7f950808d8a29435ed%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637438104737568024%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=tkXDjynhNGNFUlbeYrJOetVPEXsBrFiVnvnpwUV1I38%3D&reserved=0.

haiqu commented 3 years ago

I've built 1.0.2 and will test the output in both qual.iobio and WGSExtract. If there are any further issues I'll let you know. Thanks for the tip about -bSpace I seem to have missed that in the docs.

Rob

haiqu commented 3 years ago

Hi Bill,

image

Ah, that's more like it. Testing in WGSExtract was also successful, so this issue is technically resolved. I'd like to change the sorting issue to a feature request though, since having them in the reverse order triggers my CDO.[1]

I have no opinion about the alt contigs, because I don't use them. As Heng Li wrote[2] in 2017:

"Inclusion of ALT contigs. ALT contigs are large variations with very long flanking sequences nearly identical to the primary human assembly. Most read mappers will give mapping quality zero to reads mapped in the flanking sequences. This will reduce the sensitivity of variant calling and many other analyses. You can resolve this issue with an ALT-aware mapper, but no mainstream variant callers or other tools can take the advantage of ALT-aware mapping."

Rob

[1] It's like OCD, but in the correct alphabetical order. [2] https://lh3.github.io/2017/11/13/which-human-reference-genome-to-use

bolosky commented 3 years ago

Turns out that spec notwithstanding the Dragen variant caller does require that the @SQ lines in the BAM file be in the same order as in the reference FASTA, so I implemented it. It'll be in the next release. So, your OCD can relax the smallest amount. :-)

bolosky commented 2 years ago

Fixed in 2.0.