Open southquist opened 2 years ago
I've done some further testing with a second machine that also have 8 GPUs but they are a different model, and I can use all 8 cards there without any issue.
These are the cards on the machine that works:
$ lspci | grep NVIDIA
2d:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
32:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
5b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
5f:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
b5:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
be:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
df:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
e7:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
And these are the cards on the machine were I can only use the first 4 GPUs.
$ lspci | grep NVIDIA
1d:00.0 3D controller: NVIDIA Corporation TU102GL (rev a1)
23:00.0 3D controller: NVIDIA Corporation TU102GL (rev a1)
43:00.0 3D controller: NVIDIA Corporation TU102GL (rev a1)
49:00.0 3D controller: NVIDIA Corporation TU102GL (rev a1)
b4:00.0 3D controller: NVIDIA Corporation TU102GL (rev a1)
ba:00.0 3D controller: NVIDIA Corporation TU102GL (rev a1)
e0:00.0 3D controller: NVIDIA Corporation TU102GL (rev a1)
e6:00.0 3D controller: NVIDIA Corporation TU102GL (rev a1)
Hello everyone,
I'm using the device plugin with a machine that has 8 GPUs. I am able to use GPU 0-3 without any issue. but pods scheduled to GPU indexes 4-7 fails to start.
Some more info on my setup.
Error from
kubectl describe pod
And this is the error from the gpushare-device-plugin logs.
But I do see all the GPUs just fine with
kubectl-inspect-gpushare
Has anyone else seen this issue, or any idea what might be causing it?