bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
989 stars 354 forks source link

bcbio jobs on SGE crash because of running into default SGE job resource limits #2227

Closed WimSpee closed 4 years ago

WimSpee commented 6 years ago

Hi,

Since recently my bcbio jobs crash on a SGE cluster because the jobs run into the default job memory limit.

By default all jobs on our cluster, unless overwitten, have these memory limits: -l vf=2G,h_rss=6G

In the bcbio_system.yaml configuration I have the default of 3G per CPU, and set default job size to 10 CPU.

# default options, used if other items below are not present
--
# avoids needing to configure/adjust for every program
default:
memory: 3G
cores: 10

According to the bcbio log this is also what is requested on the cluster.

[2018-01-17T09:44Z] clustermaster: Resource requests: bwa, sambamba, samtools; memory: 3.00, 3.00, 3.00; cores: 10, 10, 10
--
[2018-01-17T09:44Z] clustermaster: Configuring 10 jobs to run, using 10 cores each with 30.1g of memory reserved for each job

But the qstat -j job status shows that only mem_free was correctly requested by bcbio, virtual_free and h_rss still have the default limits.

hard resource_list: virtual_free=2G,h_rss=6G,mem_free=30822M

This causes bcbio jobs to crash the moment the BWA+Samtools sort pipe hits the 6G limit

01/17/2018 14:37:57.135|            main|noden24|W|could not find pid 18425 in job list
01/17/2018 16:58:47.027|            main|noden24|W|job 16956975.1 exceeds job master hard limit "h_rss" (6466166784.00000 > limit:6442450944.00000) - initiate terminate method
01/17/2018 16:58:47.860|            main|noden24|W|could not find pid 28912 in job list
01/17/2018 16:59:06.570|            main|noden24|W|job 16956975.2 exceeds job master hard limit "h_rss" (6462726144.00000 > limit:6442450944.00000) - initiate terminate method
01/17/2018 16:59:07.853|            main|noden24|W|could not find pid 29026 in job list
01/17/2018 17:00:11.975|            main|noden24|W|job 16956975.3 exceeds job master hard limit "h_rss" (6570049536.00000 > limit:6442450944.00000) - initiate terminate method
01/17/2018 17:00:12.740|            main|noden24|W|could not find pid 29210 in job list
01/17/2018 17:00:36.882|            main|noden24|W|job 16956975.4 exceeds job master hard limit "h_rss" (6569807872.00000 > limit:6442450944.00000) - initiate terminate method

I can manually prevent this issue by providing higher resource limits for all bcbio at the invocation of the main bcbio script. bcbio_nextgen.py ../config/DA_1052_01_15_samples-merged.yaml -t ipython -n 101 -s sge -q main -r vf=3G,h_rss=30G

This wasn't necessary in the past. I don't know what changed, i.e. an update in bcbio 1.0.7, or an update to my cluster.

Would it be possible to have bcbio always overwrite the virtual_free and h_rss limits, using the configuration from the bcbio_system.yaml file? This would be more consistent and user friendly then having to provide the extra resource requests at the invocation of the main script.

Thank you.

roryk commented 6 years ago

Hi @WimSpee,

Thanks for the super nice bug report and the suggestion. Am I right in thinking that h_rss hardcaps the memory usage for a job and virtual_free soft caps the usage on a per-core basis? Is the idea behind these two that the entire job on a node gets the memory usage hardcapped via h_rss but virtual_free is used instead of h_vmem because we don't care if a core uses more memory, just as long as the whole job doesn't go over?

roryk commented 6 years ago

Poking around, it looks like if we just set h_rss in addition to mem_free this should cover your setup and do more like what we want SGE to be doing, respecting the memory limits set for the job.

WimSpee commented 6 years ago

Yes I also think that overwriting just h_rss to the same value as mem_free (plus maybe 10%) should fix the issue.

If that is not enough then you could also try overwriting virtual_free to the per CPU limit defined in bcbio_system.yaml.

I don't know what h_vmem does.

roryk commented 5 years ago

Thanks, closing this for now, let me know if you still had to manually do this workaround and I can add it to ipython-cluster-helper.

WimSpee commented 5 years ago

We currently still manually set the h_rss and mem_free values for the bcbio_nextgen.py command.

# Overwrite the default SGE job memory hard limits. bcbio oversubscribes the CPU it's request, therefore aks for extra memory, so less CPU fit per machine.
-r vf=5G,h_rss=50G 
roryk commented 5 years ago

Thanks @WimSpee, glad the workaround is working for you for the time being. I'll work on having this updated in ipython-cluster-helper on a separate branch, which I'll need you to test since I don't have access to a cluster running SGE.

roryk commented 5 years ago

Do you need to set virtual_free too to get it to work, or does just setting h_rss work?

WimSpee commented 5 years ago

We experimentally found out that we need to set both vf=5G and h_rss=50G . Notice that there is a 10X difference between the 2 values. This is because we changed the default bcbio job size to be 10 cpu in the bcbio system.yaml file.

vf seems to be a cpu specific value and h_rss a job specific value.

I am not export on these settings, this is just what we found works for us on our SGE cluster and keeps the sys admin and other users happy when we run large alignment(bwa) and variant calling (gatk4) (=variant2 analysis).

Without setting these values the jobs either get killed, or the load number of the cluster machines is too high. I am not sure how specific these settings are to our cluster or if some general rule can be made that applies to all (SGE) clusters.

roryk commented 5 years ago

Thanks! I think setting h_rss seems like it will make SGE do what we were actually intending, which is to not exceed the memory limit we are giving it, so I'll definitely set that and we can test, it should help you all not waste as much memory too. I'll read a little more about what virtual_free is doing, but you might need to keep setting that if it is something specific to your setup. One of the good things about SGE is it is highly configurable, but it makes it hard to generalize some of these settings.