单pod多gpu容器时设备插件匹配失败问题

baozhiming commented 3 years ago

使用gpu共享扩展时，在一个pod里面部署两个容器，第一种情况是两个容器都要gpu,第一个容器要1G，第二个容器要2G，看设备插件扩展的日志是发现pod要3G内存，但是其中的容器只要1G内存，无效的allocation，不匹配，创建失败。第二种情况是一个要gpu，一个不要gpu，可以看见是可以成功部署的。那我理解是不是此方案就没有实现单pod多gpu容器的情况，只是简单的做了allocation是否相等的判断。当我们需要单pod多gpu容器的情况下，是否只需要改一下allocation等于的条件为小于等于即可呢？还是有其他的方法

I0111 11:25:33.570389       1 podmanager.go:123] list pod binpack-4-656c8d5cbc-dsvpp in ns mb-test in node editmax-pc and status is Pending
I0111 11:25:33.570423       1 podutils.go:91] Found GPUSharedAssumed assumed pod binpack-4-656c8d5cbc-dsvpp in namespace mb-test.
I0111 11:25:33.570436       1 podmanager.go:157] candidate pod binpack-4-656c8d5cbc-dsvpp in ns mb-test with timestamp 1610364333371132383 is found.
I0111 11:25:33.570451       1 allocate.go:70] Pod binpack-4-656c8d5cbc-dsvpp in ns mb-test request GPU Memory 3 with timestamp 1610364333371132383
W0111 11:25:33.570469       1 allocate.go:152] invalid allocation requst: request GPU memory 1 can't be satisfied.

I0111 11:58:18.735043       1 podmanager.go:123] list pod binpack-3-5654c7ccbb-7rjqd in ns mb-test in node editmax-pc and status is Pending
I0111 11:58:18.735064       1 podutils.go:91] Found GPUSharedAssumed assumed pod binpack-3-5654c7ccbb-7rjqd in namespace mb-test.
I0111 11:58:18.735075       1 podmanager.go:157] candidate pod binpack-3-5654c7ccbb-7rjqd in ns mb-test with timestamp 1610366298620836846 is found.
I0111 11:58:18.735088       1 allocate.go:70] Pod binpack-3-5654c7ccbb-7rjqd in ns mb-test request GPU Memory 3 with timestamp 1610366298620836846
I0111 11:58:18.735099       1 allocate.go:80] Found Assumed GPU shared Pod binpack-3-5654c7ccbb-7rjqd in ns mb-test with GPU Memory 3
I0111 11:58:18.735113       1 server.go:70] Get devIndexMap: map[1:GPU-516466d5-c914-2b55-685d-a1857e54d0c0 0:GPU-92c4b30a-7269-34e2-0673-1eeb6ebbfc35]

baozhiming commented 3 years ago

someone?

nicozhang commented 3 years ago

someone?

我这边遇到的问题是配置完之后，docker inspect gpushare 报segmentation fault。你这边的kubectl 版本是多少？

baozhiming commented 3 years ago

我这边遇到的问题是配置完之后，docker inspect gpushare 报segmentation fault。你这边的kubectl 版本是多少？

我的版本是1.17.3, 你报的这个错误是分配的内存大于剩余内存哦

nicozhang commented 3 years ago

我这边遇到的问题是配置完之后，docker inspect gpushare 报segmentation fault。你这边的kubectl 版本是多少？

我的版本是1.17.3, 你报的这个错误是分配的内存大于剩余内存哦

我这边问题解决了，kubectl-inspect-gpushare 这个可执行文件，没有下载完。直接运行就报错了。

nicozhang commented 3 years ago

又遇到个新问题，在 device-plugin-ds.yaml 改了memory-unit 为 MiB，删除了pod 重新 create。还是只能用 gb 为单位创建。即便是用 gb，通过nvidia-smi 查看，还是会超过 1G。请问遇到过这个问题么？

baozhiming commented 2 years ago

我这边看看能不能复现一下

sloth2012 commented 1 year ago

又遇到个新问题，在 device-plugin-ds.yaml 改了memory-unit 为 MiB，删除了pod 重新 create。还是只能用 gb 为单位创建。即便是用 gb，通过nvidia-smi 查看，还是会超过 1G。请问遇到过这个问题么？

解决了吗

AliyunContainerService / gpushare-scheduler-extender

单pod多gpu容器时设备插件匹配失败问题 #143