superstr taking ~6 hours to process a 80GB BAM file

chrisclarkson commented 1 year ago

Hello, Thank you for making this software available! I downloaded your software a couple months ago and have been trying it out. I have ~8000 WGS BAM files that I would like to process but it is currently taking 6-8 hours to process them with the following code:

superstr mode=bam -o ${BAM}_out -t 0.64 ${path}

Each genome is ~80GB. I saw that you have some recommendations for parallelisation. However the xargs options are not available on the cluster that I use- do you have any recommendations for how to parallelise/speed up the process?

Is there a later version of this software that might be faster? I am working on a SLURM HPC. Thanks again!

lfearnley commented 1 year ago

That does seem to be taking a lot longer than I'd expect, although I've encountered some delays on some HPC configurations.

It's a bit hard to offer immediate recommendations without knowing a little more about your configuration. One thing that can be done very simply is to increase the -t threshold; this will reduce the amount of reads processed during repeat checking.

Are you able to share a bit more about your HPC? Is the data on HDD, SSD? Is there possibly a tape operation slowing things down a bit up front?

chrisclarkson commented 1 year ago

Hi thank you for getting back to me! I'm not actually sure if the data are stored on HDD/SSD. I tried looking on our documentation but it does not say clearly anywhere- I will send our admin a link to this post and hopefully I can clarify later... The operating system is Centos. The command lsblk --output NAME,TYPE,ROTA indicates that the HPC has a mixture of both.... I am trying to parallelize across my BAM files as follows:

Does this help?

Thanks again

lfearnley commented 1 year ago

No problem. The thing with superSTR is that it needs to read through the BAM file completely, so the first point of call is to check the performance on the read operation. I've seen some spiky performance on network-attached storage under heavy load, so that's always a possibility.

You'd need to cd to the directory, run pwd -P to get the physical path, then check that path against the lsblk output. If the data is on network-attached storage in a HPC that's less likely to be useful.

I'm less familiar with bsub - that's a IBM LSF scheduler command, rather than SLURM? I'll have a look at the manual and see what I can work out from here, but my initial impression is that if you're running one superSTR command per job, then the resource specification there is too high - you only need 1CPU (-n 1). This should allow the scheduler to run more jobs if you're subject to a CPU limit, and increase your total throughput on your system by a factor of 6.

I think from what I can tell the -M command is specifying 72MB of memory, which should be enough but can probably be bumped a bit higher (say to 100000), because you're unlikely to be impacted by memory demand on the scheduler; I'm less certain on the rusage command.

bahlolab / superSTR

superstr taking ~6 hours to process a 80GB BAM file #20