MonashBioinformaticsPlatform / RNAsik-pipe

RNAsik - more than just a pipeline
https://monashbioinformaticsplatform.github.io/RNAsik-pipe/
Apache License 2.0
13 stars 5 forks source link

Memory requirements for makeSTARindex should be explicitly specified #7

Closed pansapiens closed 6 years ago

pansapiens commented 6 years ago

This is important for running on HPC queues (eg SLURM). Setting it to the default value used for alignment might be sufficient.

A quick look suggested index generation on GRCm38 consumed ~ 26 Gb RAM, but may have peaked higher.

ie, for 64 Gb RAM (over allocating, but safe), specify:

task(!fastaRef.isEmpty(), genomeIdxFiles <- fastaRef, cpus := threads, mem := 68719476736, taskName := "Making STAR index")

This probably needs to be generalized in sik.config to allow memory settings for each task that might consume more than ~4Gb of RAM (index generation, alignment, mark duplicates), with sensible defaults.

serine commented 6 years ago

not sure I understand this one, user should be able to specify any amount of memory through -memory flag on command line?

pansapiens commented 6 years ago

For BDS to correctly set the memory for an sbatch/qsub job, it needs to be specified for the BDS task, not the whole pipeline. Each task is it's own job on the queue, so it makes sense to explicitly set memory for any of the tasks that will require significant memory. This is independent of any command line memory options STAR, picard/JVM etc might have (ideally the BDS task mem setting should be slightly higher than any memory setting the tool itself uses).

serine commented 6 years ago

@pansapiens sorry been busy, took a while to get back to this issue. Is this not what you need https://github.com/MonashBioinformaticsPlatform/RNAsik-pipe/blob/2c19a5dfd8c17238a75f036a821e5a16835bcbfc/src/sikSTARaligner.bds#L89 ?

Because STAR memory gets set exactly here..

Or have I fixed this because of this comment and forgot to close this issue?

pansapiens commented 6 years ago

Ah, I see - just setting -memory would solve the immediate issue with STAR index generation, but this exposes a set of related issues (especially with regard to running on an HPC job queue).

This -memory setting is used for both STAR and BWA which have fairly different RAM requirements. I guess with a name like -memory I'd expected this to be a setting that somehow applies to the whole run. Since it really only applied to STAR and BWA, I think -memory should be renamed -alignerMemory, or even better split into -starMemory and -bwaMemory (so you can't accidentally switch aligners but forget to change the -alignerMemory setting).

For proper utilisation on a cluster every task should have it's mem := option set explicitly (with sensible defaults), otherwise any task that consumes more RAM than whatever the default job limit is on that cluster will be killed (eg SLURMs default value for the --mem setting might be only 4Gb). Also, small tasks may needlessly wait in the queue for more RAM than they really need when there is a node with a few cores and small amount of RAM actually available. The bds.config mem setting can be used as a fallback default value for tasks without mem := specified, but it's better not to rely on this.

serine commented 6 years ago

Just want to add couple of general comments here. @pansapiens and I have spoke off and online about memory allocation issues. I'll prioritise this now and attempt to implement mem setting for every task, or at least for every "major" task.

Now it turns out that if you attempt to run RNAsik on machine with small resources, which wont be enough to start STAR aligner, then BDS behaviour is some what unexpected, or at least this is how I'm interpreting this. If there isn't enough compute resource for a particular task, RNAsik (bds) will skip that task and go to the next available task that meets the resources requirement and dependencies requirement.

This is "grey" area of RNAsik where not every task has dependencies set up, mainly because it was either hard to do OR where were no real dependencies.

This shouldn't effect those who runs RNAsik with right system requirements, that is at lest 30 Gb of RAM and 4 cpus (this is based on human/mouse, for species with smaller genomes, less RAM will be required)