Jobs failing due to not requesting enough resources in SLURM

j97kang commented 3 months ago

Hello, I hope you guys are doing well! I'm currently using the qbatch script and when I'm currently running into some issues when I configure it as below:

export QBATCH_PPJ=32 # requested processors per job export QBATCH_CHUNKSIZE=$QBATCH_PPJ # commands to run per job export QBATCH_CORES=$QBATCH_PPJ # commonds to run in parallel per job export QBATCH_MEM="32G" # requested memory per job export QBATCH_MEMVARS="mem" # memory request variable to set export QBATCH_SYSTEM="slurm" # queuing system to use ("pbs", "sge","slurm", or "local") export QBATCH_SHELL="/bin/sh"

When these lines aren't commented out, it seems like that the PATH isn't exported, resulting in the modules not getting through to the batch calls. I'm currently using the ants multivariant template construction code. Would love to get some insights on getting information passed over to the batch calls. What would the fix be? Thanks! Appreciate it a lot!

gdevenyi commented 3 months ago

Is this a support request for qbatch? Please use the qbatch GitHub repo and provide a complete example of your setup, the command you are running and the output you get.

Similarly if this is for this repo provide the same.

j97kang commented 3 months ago

I'm calling on the modelbuild.sh! Currently the outputs are the batch job submissions, and under logs I see calls to ants_registration_Syn.sh, but none of them are actually run. It seems like the jobs are called very quickly resulting in the walltime limits hit, causing the dependency jobs to fail.

gdevenyi commented 3 months ago

Please provide the exact commands you are running, and the contents of ${outputdir}/logs

j97kang commented 3 months ago

The exact command that I'm running is this: ./optimized_antsMultivariateTemplateConstruction-master/modelbuild.sh --output-dir ./output_dir files.txt

with these exports: cd /path/to/dir export PATH=/path/to/ants-2.5.2/bin:$PATH source ./venv/bin/activate module load python/3.9.2_torch_gpu module load parallel/20210322 the "venv" is where the latest version of the qbatch is loaded into

The output contents are the job submissions under the logs:

As you can see, the images are all created here at the same time which shouldn't be happening since all subjects needs to generate images. The commands for multiple patients are called within seconds. The stages shouldn't be proceeding without finishing the previous ones, and it's happening too quickly as shown below. I've also attached the qbatch submission file for the env copy exports. modelbuild_2024-06-25_22-20-40-UTC_rigid_0_reg.array.txt

gdevenyi commented 3 months ago

Please provide the entirety of the contents if ${outputdir}/logs. A screenshot is not appropriate.

Please see https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/attaching-files for details on how to attach files to an issue.

j97kang commented 3 months ago

logs.zip

I've attached the logs!

gdevenyi commented 3 months ago

I have renamed the issue to accurately reflect what is happening.

At the end of your logs, you can find details on why things are failing:

slurmstepd: error: *** JOB 10938887 ON cn109 CANCELLED AT 2024-06-25T19:06:42 DUE TO TIME LIMIT ***
slurmstepd: error: Detected 25 oom_kill events in StepId=10938887.batch. Some of the step tasks have been OOM Killed.

So, the jobs, as submitted are overloading the machines they are running on. Commands are being killed due to out-of-memory, and insufficent time is allocated for the commands to run in the time allotted.

Your current QBATCH config:

export QBATCH_PPJ=32 # requested processors per job
export QBATCH_CHUNKSIZE=$QBATCH_PPJ # commands to run per job
export QBATCH_CORES=$QBATCH_PPJ # commonds to run in parallel per job
export QBATCH_MEM="32G" # requested memory per job

Is running 32 registrations in parallel in a single job with 32GB of RAM available.

You are not specifying any job submission details, so you are getting the default walltimes:

        --walltime-short: Walltime for short running stages (averaging, resampling) (default: '00:30:00')
        --walltime-linear: Walltime for linear registration stages (default: '0:45:00')
        --walltime-nonlinear: Walltime for nonlinear registration stages (default: '4:30:00')

So, you need to breakup the job requests so that more resources are available per command,

Options

Reduce QBATCH_CHUNKSIZE && QBATCH_CORES, fewer commands will be allocated per job request, allowing for more CPUs and memory pre request
Increase QBATCH_MEM, more memory will be made available per command. Probably need to increase --walltime-linear as well in this case

I do not know your SLURM or cluster configuration, or the resolution of your input files, all of which affect these variables. You may find it useful to produce a dummy dataset (~2 representative images) and set QBATCH to be the maximal job allocation (QBATCH_PPJ max your cluster supports, QBATCH_MEM max your cluster supports, QBATCH_CHUNKSIZE=1) and examine the resource usage of the jobs afterwards in order to best choose a memory and CPU request, and how to chunk your jobs. qbatch offers lots of flexibility in terms of whether you wish to be memory-packed or cpu-packed, and stay within the QoS of your cluster. Remember that ANTs will use the maximum CPUs it sees by default, deep inside the script it will honour the QBATCH_PPJ/QBATCH_CORES ratio to choose the number of CPUs to use. Runtime is approximately linear with the number of CPUs assigned to ANTs.

j97kang commented 3 months ago

Thank you so much! This actually helped so much and fixed our issue! Hope you have a wonderful rest of the week.

CoBrALab / optimized_antsMultivariateTemplateConstruction

Jobs failing due to not requesting enough resources in SLURM #107