Parallelization - Githubissues

mesamuels commented 2 years ago

I am trying to run RepeatMasker on the ComputeCanada supercomputer system. Using mostly default parameters, it is taking a very long time, it's timed out several times, I'm now up to 24 hours and it is only halfway through the first part of going through the 'batches' (whatever those are). I have access to a lot of cpu power and storage, so can easily parallelize, however the instructions are not clear on how to do this. I suppose the relevant parameter is the -pa switch, but it is not clear exactly what it is doing, or what are good values to use. Also, I have to allocate the appropriate amount of memory and threads from the system, otherwise telling the program to parallelize won't do any good. Currently it's running on 4 of our cpu's, totalling 16 gb of memory. I can ask for a lot more, but it's also a waste of resources to ask for tons of memory that the program won't use anyway.

My dataset is a new assembly of short-read data from a native plant species, built with Platanus. The N50 is around 6kb, although there are a handful of scaffolds in the tens of thousands size range. Nothing bigger than 150kb I think.

Can you clarify how to speed things up by running more processes/threads/etc?

Thanks! Mark Samuels Associate Professor in Medicine University of Montreal

rmhubley commented 2 years ago

Did you get this to work on your system? I have a Nextflow workflow that works with SLURM (and other schedulers) if that might be useful. I typically break up the genome into 50MB non-overlapping batches and then run these jobs on 12 core nodes in parallel ( using '-pa 12' for each RepeatMasker run ). The workflow then rejoins the batches, correcting the coordinates etc. Let me know if you still need any help with this.

mesamuels commented 2 years ago

Robert, sorry I did not reply right away, I'm just about recovered from a bout of COVID (no idea how I got it, I had my fourth shot just over a month before). I did get things working ok eventually.

thanks, Mark

From: Robert Hubley @.> Sent: August 19, 2022 8:15 PM To: rmhubley/RepeatMasker @.> Cc: Mark E. Samuels @.>; Author @.> Subject: Re: [rmhubley/RepeatMasker] Parallelization (Issue #154)

Did you get this to work on your system? I have a Nextflow workflow that works with SLURM (and other schedulers) if that might be useful. I typically break up the genome into 50MB non-overlapping batches and then run these jobs on 12 core nodes in parallel ( using '-pa 12' for each RepeatMasker run ). The workflow then rejoins the batches, correcting the coordinates etc. Let me know if you still need any help with this.

— Reply to this email directly, view it on GitHubhttps://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Frmhubley%2FRepeatMasker%2Fissues%2F154%23issuecomment-1221183458&data=05%7C01%7Cmark.e.samuels%40umontreal.ca%7Cc9d027d8146e4758b5a808da8241194c%7Cd27eefec2a474be7981e0f8977fa31d8%7C1%7C0%7C637965513349198948%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=xPba2mfN6aMUt1wpwi6JYybY0YFEx5tkk7mcnrIKcog%3D&reserved=0, or unsubscribehttps://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAXRW375URB227P6KQPV7JYLV2APSJANCNFSM5PYA4WCA&data=05%7C01%7Cmark.e.samuels%40umontreal.ca%7Cc9d027d8146e4758b5a808da8241194c%7Cd27eefec2a474be7981e0f8977fa31d8%7C1%7C0%7C637965513349355191%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ckWNjiPO6umde1Bi2B9NEgheZjWu2z6ECZbW6WIF1eI%3D&reserved=0. You are receiving this because you authored the thread.Message ID: @.***>

Dfam-consortium / RepeatMasker

Parallelization #154