Open johrstrom opened 5 months ago
I will note that the ability to tie a given GPU to a job is something very specific to OSC and not coming from the exporters like the one NVIDIA provides. We have a job prolog that writes out a static metric in such a way that we can use PromQL to tie a job to the GPUs assigned to that job. The tools from the community and NVIDIA can only monitor the whole node's GPUs.
Thanks. I've commented on the discourse asking if they can provide the way they do it. Maybe there's some documentation we could add for the same.
Don't know if you're following the discourse post, but it seems they built an exporter: https://github.com/plazonic/nvidia_gpu_prometheus_exporter/tree/master with some additional prologue and epilogue stuff.
Ah custom exporter, we just use the one from NVIDIA: https://github.com/NVIDIA/dcgm-exporter.
Our prolog script:
if [ "x${CUDA_VISIBLE_DEVICES}" != "x" ]; then
GPU_INFO_PROM=${METRICS_DIR}/slurm_job_gpu_info-${SLURM_JOB_ID}.prom
cat > $GPU_INFO_PROM.$$ <<EOF
# HELP slurm_job_gpu_info GPU Assigned to a SLURM job
# TYPE slurm_job_gpu_info gauge
EOF
OIFS=$IFS
IFS=','
for gpu in $CUDA_VISIBLE_DEVICES ; do
echo "slurm_job_gpu_info{jobid=\"${SLURM_JOB_ID}\",gpu=\"${gpu}\"} 1" >> $GPU_INFO_PROM.$$
done
IFS=$OIFS
/bin/mv -f $GPU_INFO_PROM.$$ $GPU_INFO_PROM
fi
exit 0
The metrics are written to a location that is picked up by node exporter.
Epilog:
GPU_INFO_PROM=${METRICS_DIR}/slurm_job_gpu_info-${SLURM_JOB_ID}.prom
rm -f $GPU_INFO_PROM
exit 0
PromQL from our dashboards that tie a job to a given GPU:
DCGM_FI_DEV_GPU_UTIL{cluster="$cluster",host=~"$host"} * ON(host,gpu) slurm_job_gpu_info{jobid="$jobid"}
DCGM_FI_DEV_MEM_COPY_UTIL{cluster="$cluster",host=~"$host"} * ON(host,gpu) slurm_job_gpu_info{jobid="$jobid"}
DCGM_FI_DEV_FB_USED{cluster="$cluster",host=~"$host"} * ON(host,gpu) slurm_job_gpu_info{jobid="$jobid"}
max((DCGM_FI_DEV_FB_FREE{cluster="$cluster",host=~"$host"} + DCGM_FI_DEV_FB_USED{cluster="$cluster",host=~"$host"}) * ON(host,gpu) slurm_job_gpu_info{jobid="$jobid"}) by (cluster)
With how our Grafana integration works, we could integrate this now I think. The GPU panels are already part of the "OnDemand Clusters" dashboard we use for CPU and memory.
I think we'd just need some mechanism possibly to only show the GPU panels when the job is a GPU job. The schema for cluster YAML in OnDemand would just need to handle one or two more keys, maybe like:
cpu: 20
memory: 24
gpu-util: <num for panel>
gpu-mem: <num for panel>
Someone on discourse is asking for
gpu
panel support in Activejobs grafana integration. https://discourse.openondemand.org/t/grafana-in-ood-ability-to-embed-other-panels/3575Given the rise of AI and GPU demand and so on, we should likely support this case, even if many or most jobs don't use GPUs.