NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
373 stars 49 forks source link

H100 GPU docker container exit 137 #125

Open nusaputra137 opened 10 months ago

nusaputra137 commented 10 months ago

Hello,

I am trying to run this on my instance which has 8 NVIDIA H100 GPUs but my docker container seems to be exiting due to 137. I tried setting the memory to be at 1GB and still happening with the same result. Is anyone else having this issue?

NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2

Docker version 20.10.21, build baeda1f

docker run -d --gpus all --m 1g -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.2.5-3.1.8-ubuntu20.04