bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
994 stars 354 forks source link

option to run controller job on different partition #3604

Open vdejager opened 2 years ago

vdejager commented 2 years ago

As described in the docs: https://bcbio-nextgen.readthedocs.io/en/latest/contents/parallel.html#ipython-parallel

this would submit a launcher job to queue "priority", the controller job to "medium" and the work jobs to medium as well. Is it possible to run the controller job on a different partition than the work jobs? Our cluster has a specific queue for small core count jobs like controller jobs

naumenko-sa commented 2 years ago

No, unfortunately, there is no such option.

However, only the launcher (the job that goes to priority queue) is small, the controller job could require up to 20G of RAM and more in some cases (-r conmem 20)

vdejager commented 2 years ago

Thanks. it turns out the I could use the --local_controller option with -r conmem 16 for the launcher job. Apparently the node on which the launcher/controller job lands reserves 16G per core. This was enough for my job.

I'm charged per 0.25 node (one node is 128 cores/256 GB) For the workers I use the cores=32 option in combination with -n 256 to get two full "nodes" (may be on different physical machines). Each worker job has 64GB available. This to get a higher chance jobs will start right away.

Would there be an advantage to use the --exclusive slurm option to claim full physical nodes and bump the cores setting to 128?

naumenko-sa commented 2 years ago

Thanks for digging out the --local_controller option! Now I've documented it here https://bcbio-nextgen.readthedocs.io/en/latest/contents/parallel.html#ipython-parallel

I think for the typical analysis which is rather I/O intensive than computationally intensive, getting less physical nodes would restrict your total I/O bandwidth.

Of course it depends on your particular environment. Say, on a cluster with a very good storage and high I/O bandwidth for each node to the storage system, you'd rather want 8 independent workers (256/32 = 8) compared to 2 (256/128 = 2) - 8 writing threads instead of 2. On a cluster with limited I/O bandwidth you may end up with somebody else's job on a node, and your job chokes sharing the channel with the other I/O demanding job, so there you might want a node exclusive for your job to guarantee the I/O bandwidth.

So it needs some profiling in your particular use case + data type + cluster + cluster load.

Not sure if the admins would allow you to use SAR, but at least you can measure the runtime for 8 workers vs 2 workers. Please share the results of such benchmark if you get to do it!

Also, I think most of the aligners/callers don't scale well with high cores number. 32 cores may give you similar performance as 128 cores - another argument in favour of 8 workers.

If you need high processing speed for many samples - go for Dragen, also available on AWS.

vdejager commented 2 years ago

Running SAR is not an option. I'll try to profile the workers. Another option to test will be targeting the jobs at high memory nodes and adjusting the java params. For now, time is not really an issue and since we only have a limited number of samples, running Dragen on AWS will only eat the budget (its hard to justify when subsidized computing time is available)

Maybe another thing: is there a script to dissect the log file to get individual timings for alignment, variant calling and other metrics?

naumenko-sa commented 2 years ago

No, I just parsed it in the command line when I needed. 50% of variant calling pipeline is bwa alignment. 10% - variant calling per se (mutect2), and the rest is annotation and etc.