amplab / snap

Scalable Nucleotide Alignment Program -- a fast and accurate read aligner for high-throughput sequencing data
https://www.microsoft.com/en-us/research/project/snap/
Apache License 2.0
287 stars 66 forks source link

building index #140

Closed biolxy closed 2 years ago

biolxy commented 2 years ago

How to build an index for a 130G nt.fa file on a Linux with only 300G of RAM?Thanks

bolosky commented 2 years ago

You can try the -sm (small memory) option to index build. It may be that you don’t have enough RAM though.

From: biolxy @.> Sent: Wednesday, December 15, 2021 2:37 AM To: amplab/snap @.> Cc: Subscribed @.***> Subject: [amplab/snap] building index (Issue #140)

How to build an index for a 130G nt.fa file on a Linux with only 300G of RAM?Thanks

- You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Famplab%2Fsnap%2Fissues%2F140&data=04%7C01%7Cbolosky%40microsoft.com%7Ca4220f9322c341efb0e508d9bfa600b4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637751541960115615%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=GcxZ7MECEQejdpxhD9gxw4dCl%2Fse44JXWeWxvf81W%2F8%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAHPTWPCOYI72FIWSDGNFLDURBHRFANCNFSM5KDAFT5A&data=04%7C01%7Cbolosky%40microsoft.com%7Ca4220f9322c341efb0e508d9bfa600b4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637751541960115615%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=3ZVxMBHicH%2BJ4UdcKFR0t18PrAyIkTBrto2kHdm68z8%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cbolosky%40microsoft.com%7Ca4220f9322c341efb0e508d9bfa600b4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637751541960115615%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=qslmyoufcT5imCyhExOp3g%2FhOLxYlvBi3aF5cbNFhmo%3D&reserved=0 or Androidhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cbolosky%40microsoft.com%7Ca4220f9322c341efb0e508d9bfa600b4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637751541960115615%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Q7fm%2BHltYdSaXZTYtehg%2F%2BaABuWTxxH4mpYddnDoFkM%3D&reserved=0.

sfederman commented 2 years ago

You can try Bill’s method - but as he mentioned, I suspect you will not have enough RAM to make a single nt index, nor perform an alignment to all of nt with 300GB of RAM.

Alternatively, you can split the nt FASTA into several chunks (you’ll need to play with this on your system to figure out the optimal number), and make a separate index for each chunk. When you perform the alignment, you’ll then want to have some code to collate the alignments from the individual index chunks and come up with hits across all of the chunks.

As for splitting the nt FASTA into chunks, this can be done taxonomically (e.g. virus, bacteria, etc…), or randomly depending on what makes the most sense for your project.

biolxy commented 2 years ago

You can try the -sm (small memory) option to index build. It may be that you don’t have enough RAM though. From: biolxy @.> Sent: Wednesday, December 15, 2021 2:37 AM To: amplab/snap @.> Cc: Subscribed @.***> Subject: [amplab/snap] building index (Issue #140) How to build an index for a 130G nt.fa file on a Linux with only 300G of RAM?Thanks - You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Famplab%2Fsnap%2Fissues%2F140&data=04%7C01%7Cbolosky%40microsoft.com%7Ca4220f9322c341efb0e508d9bfa600b4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637751541960115615%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=GcxZ7MECEQejdpxhD9gxw4dCl%2Fse44JXWeWxvf81W%2F8%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAHPTWPCOYI72FIWSDGNFLDURBHRFANCNFSM5KDAFT5A&data=04%7C01%7Cbolosky%40microsoft.com%7Ca4220f9322c341efb0e508d9bfa600b4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637751541960115615%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=3ZVxMBHicH%2BJ4UdcKFR0t18PrAyIkTBrto2kHdm68z8%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cbolosky%40microsoft.com%7Ca4220f9322c341efb0e508d9bfa600b4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637751541960115615%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=qslmyoufcT5imCyhExOp3g%2FhOLxYlvBi3aF5cbNFhmo%3D&reserved=0 or Androidhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cbolosky%40microsoft.com%7Ca4220f9322c341efb0e508d9bfa600b4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637751541960115615%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Q7fm%2BHltYdSaXZTYtehg%2F%2BaABuWTxxH4mpYddnDoFkM%3D&reserved=0.

Thanks for your reply. Yes, I don't have enough memory and even with the -sm parameter, I still can't successfully create index

biolxy commented 2 years ago

You can try Bill’s method - but as he mentioned, I suspect you will not have enough RAM to make a single nt index, nor perform an alignment to all of nt with 300GB of RAM. Alternatively, you can split the nt FASTA into several chunks (you’ll need to play with this on your system to figure out the optimal number), and make a separate index for each chunk. When you perform the alignment, you’ll then want to have some code to collate the alignments from the individual index chunks and come up with hits across all of the chunks. As for splitting the nt FASTA into chunks, this can be done taxonomically (e.g. virus, bacteria, etc…), or randomly depending on what makes the most sense for your project.

Thank you for your reply. I think your suggestion is feasible. I have a question, (SNAPs default is to produce a single best alignment for each read that its maps) if I split the nt FASTA, and make a separate index for each chunk, then Is it possible that the same reads get the best alignment in each index lib, so which one is the most reliable for these alignment results?

bolosky commented 2 years ago

Yes, it will try to find the best alignment against each index. I suspect that lots of times it will still only find one. When it does find more, probably the easiest thing to look for is which one is the closest match. Look at the NM tag for a quick idea of that (it shows edit distance, which isn't exactly corresponding to best match because it scores indels differently, but it's close). If you find more than one with the same distance then it probably would have aligned with low mapping quality.

From: biolxy @.> Sent: Friday, December 17, 2021 2:24 AM To: amplab/snap @.> Cc: Bill Bolosky @.>; Comment @.> Subject: Re: [amplab/snap] building index (Issue #140)

You can try Bill's method - but as he mentioned, I suspect you will not have enough RAM to make a single nt index, nor perform an alignment to all of nt with 300GB of RAM. Alternatively, you can split the nt FASTA into several chunks (you'll need to play with this on your system to figure out the optimal number), and make a separate index for each chunk. When you perform the alignment, you'll then want to have some code to collate the alignments from the individual index chunks and come up with hits across all of the chunks. As for splitting the nt FASTA into chunks, this can be done taxonomically (e.g. virus, bacteria, etc...), or randomly depending on what makes the most sense for your project.

Thank you for your reply. I think your suggestion is feasible. I have a question, (SNAPs default is to produce a single best alignment for each read that its maps) if I split the nt FASTA, and make a separate index for each chunk, then Is it possible that the same reads get the best alignment in each index lib, so which one is the most reliable for these alignment results?

- Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Famplab%2Fsnap%2Fissues%2F140%23issuecomment-996526286&data=04%7C01%7Cbolosky%40microsoft.com%7C2bfdbe958ebd4921a14908d9c136a4fc%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637753262717848354%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=GVDr3d%2B4fXVyHoYkbU6kMZsCfl2sHHdhBQ%2BPwAcVD6U%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAHPTWPBTFXEEHWABFSELFDURLXTZANCNFSM5KDAFT5A&data=04%7C01%7Cbolosky%40microsoft.com%7C2bfdbe958ebd4921a14908d9c136a4fc%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637753262717848354%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=zIbxt3HPwPUjTGNj%2BmlTYYGQcuRKvkR9puWCWan%2F00E%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cbolosky%40microsoft.com%7C2bfdbe958ebd4921a14908d9c136a4fc%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637753262717848354%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=xY%2FJPr%2BqIefSMId%2F59H6isxnx4p1wAN9fFfSsr2rSz8%3D&reserved=0 or Androidhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cbolosky%40microsoft.com%7C2bfdbe958ebd4921a14908d9c136a4fc%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637753262717848354%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=VZczNpRUes8x93IKEQk%2FROz66VrqUIfsqxY9pihGoi8%3D&reserved=0. You are receiving this because you commented.Message ID: @.**@.>>

biolxy commented 2 years ago

Thank you for your help. I'll try it @bolosky