SarderLab / girder_worker

Distributed task execution engine with Girder integration, developed by Kitware
http://girder-worker.readthedocs.io/
Apache License 2.0
0 stars 0 forks source link

Batch process Multi Compartment Segmentation #3

Open anindya-paul opened 4 months ago

anindya-paul commented 4 months ago

WSI directory Batch process Failed job: CUDA OOM Failed job: CUDA OOM Apptainer Image: sayatmimar_compreps_multic-hpg_1.sif Model

It is a batch process to perform MC segmentation comprising 40 virtual slides in the directory. After starting the batch process, noticed 7 jobs started running although it is a GPU job and I see arguments "--partition=gpu --gres=gres:gpu:a100:1 --cpus-per-task=8". Two jobs immediately failed because of CUDA OOM and the rest seems running. Although 7 jobs are in running mode but in closer inspection, I see that only 3 are actually performing segmentation (means they are leveraging GPUs) and the rest 4 are waiting for GPU availability. This makes sense as we have 3 GPU resources available in pinaki.sarder-dsa group. Also, single GPU jobs seem faster (~5 mins per slide) than what I used to observe in Pubcontainers though need better quantification.

suhasthegame commented 4 months ago

This issue pertains to HPCC allocation and relates to how the HPC cluster is managing resources. Not related to application or the plugin.