It is a batch process to perform MC segmentation comprising 40 virtual slides in the directory. After starting the batch process, noticed 7 jobs started running although it is a GPU job and I see arguments "--partition=gpu --gres=gres:gpu:a100:1 --cpus-per-task=8". Two jobs immediately failed because of CUDA OOM and the rest seems running. Although 7 jobs are in running mode but in closer inspection, I see that only 3 are actually performing segmentation (means they are leveraging GPUs) and the rest 4 are waiting for GPU availability. This makes sense as we have 3 GPU resources available in pinaki.sarder-dsa group. Also, single GPU jobs seem faster (~5 mins per slide) than what I used to observe in Pubcontainers though need better quantification.
WSI directory Batch process Failed job:
CUDA OOM
Failed job:CUDA OOM
Apptainer Image:sayatmimar_compreps_multic-hpg_1.sif
ModelIt is a batch process to perform MC segmentation comprising 40 virtual slides in the directory. After starting the batch process, noticed 7 jobs started running although it is a GPU job and I see arguments "--partition=gpu --gres=gres:gpu:a100:1 --cpus-per-task=8". Two jobs immediately failed because of CUDA OOM and the rest seems running. Although 7 jobs are in running mode but in closer inspection, I see that only 3 are actually performing segmentation (means they are leveraging GPUs) and the rest 4 are waiting for GPU availability. This makes sense as we have 3 GPU resources available in pinaki.sarder-dsa group. Also, single GPU jobs seem faster (~5 mins per slide) than what I used to observe in Pubcontainers though need better quantification.