AliyunContainerService / gpushare-scheduler-extender

GPU Sharing Scheduler for Kubernetes Cluster
Apache License 2.0
1.4k stars 308 forks source link

给pod分配了1G显存,结果占了5G显存 #98

Closed onlytiancai closed 4 years ago

onlytiancai commented 4 years ago

微信图片_20200410175353

给一个pod分了1个G的显存,结果跑个tensorflow任务,把近6个G的显存全占完了,再起新的pod跑任务直接显存溢出了,请问如何解决,谢谢。

还需要其它信息的话,我再补充

~# kubectl version Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:14:22Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.4", GitCommit:"8d8aa39598534325ad77120c120a22b3a990b5ea", GitTreeState:"clean", BuildDate:"2020-03-12T20:55:23Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}

~# docker info

Client: Debug Mode: false

Server: Containers: 54 Running: 23 Paused: 0 Stopped: 31 Images: 38 Server Version: 19.03.6

Runtimes: nvidia runc Default Runtime: nvidia

~# kubectl describe node ubuntu-server-1 | grep /gpu aliyun.com/gpu-count: 1 aliyun.com/gpu-mem: 5 nvidia.com/gpu: 0 aliyun.com/gpu-count: 1 aliyun.com/gpu-mem: 5 nvidia.com/gpu: 0 aliyun.com/gpu-count 0 0 aliyun.com/gpu-mem 1 1 nvidia.com/gpu 0 0 ~# kubectl describe pod notebook-0001 | grep /gpu aliyun.com/gpu-mem: 1 aliyun.com/gpu-mem: 1

onlytiancai commented 4 years ago

我知道了,这个插件压根没有任何GPU隔离作用,要限制某个pod的显存使用上限,这个插件是没有任何帮助的。 这个插件只是起pod的时候,看下之前分配出去多少显存,还能不能再往出分配显存。

denverdino commented 4 years ago

阿里云提供了GPU资源隔离的方案,如果在阿里云上应用可以留下联系信息。

onlytiancai commented 4 years ago

您好,最后计划是用 aliyun,请问如何联系您,我邮箱是 wawasoft@qq.com,微信是 onlytiancai