NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.67k stars 607 forks source link

Limitations in Using Multiple MIG Instances in a Container #446

Open tunahanertekin opened 10 months ago

tunahanertekin commented 10 months ago

Hi,

I am curious about how different it is to use multiple MIG instances instead of multiple no-mig GPUs (such as V100) in terms of paralleling, memory sharing etc. I didn't receive same outputs in processes such as:

Any kind of help is appreciated.

github-actions[bot] commented 6 months ago

This issue has become stale and will be closed automatically within 30 days if no activity is recorded.

jiusi9 commented 1 month ago

Hi, I have the same issue, if there was a solution?

deepakdeore2004 commented 3 weeks ago

i am seeing same problem where docker shows only GPU, anyone found the issue?

# docker run --rm -it --gpus '"device=MIG-eb5ec582-a562-505c-8bf6-72a255d3360f,MIG-2e532e9b-c8ac-5a44-ab5f-368b2fc09522"' ubuntu nvidia-smi --query-gpu=uuid --format=csv
uuid
GPU-43ff88d4-81a0-5fd6-e5d9-7c0d43abf38e
klueska commented 3 weeks ago

That is expected if both of your MIG devices are on the same underlying GPU. What is the output of:

docker run --rm -it \
  --gpus '"device=MIG-eb5ec582-a562-505c-8bf6-72a255d3360f,MIG-2e532e9b-c8ac-5a44-ab5f-368b2fc09522"' \
  ubuntu nvidia-smi -L
deepakdeore2004 commented 3 weeks ago

That is expected if both of your MIG devices are on the same underlying GPU. What is the output of:

docker run --rm -it \
  --gpus '"device=MIG-eb5ec582-a562-505c-8bf6-72a255d3360f,MIG-2e532e9b-c8ac-5a44-ab5f-368b2fc09522"' \
  ubuntu nvidia-smi -L

thanks for quick reply, here is the output

# docker run --rm -it \
  --gpus '"device=MIG-eb5ec582-a562-505c-8bf6-72a255d3360f,MIG-2e532e9b-c8ac-5a44-ab5f-368b2fc09522"' \
  ubuntu nvidia-smi -L
GPU 0: NVIDIA H100 80GB HBM3 (UUID: GPU-43ff88d4-81a0-5fd6-e5d9-7c0d43abf38e)
  MIG 1g.20gb     Device  0: (UUID: MIG-eb5ec582-a562-505c-8bf6-72a255d3360f)
  MIG 1g.20gb     Device  1: (UUID: MIG-2e532e9b-c8ac-5a44-ab5f-368b2fc09522)
klueska commented 3 weeks ago

So it seems to be working as expected then.

klueska commented 3 weeks ago

Note, however, that even if oyu have multiple MIG instances available to your container CUDA will only use the first one it discovers. There is no support (yet) for running a single CUDA context with multiple MIG devices.