Closed onlytiancai closed 4 years ago
给一个pod分了1个G的显存,结果跑个tensorflow任务,把近6个G的显存全占完了,再起新的pod跑任务直接显存溢出了,请问如何解决,谢谢。
还需要其它信息的话,我再补充
~# kubectl version Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:14:22Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.4", GitCommit:"8d8aa39598534325ad77120c120a22b3a990b5ea", GitTreeState:"clean", BuildDate:"2020-03-12T20:55:23Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}
~# docker info
Client: Debug Mode: false
Server: Containers: 54 Running: 23 Paused: 0 Stopped: 31 Images: 38 Server Version: 19.03.6
Runtimes: nvidia runc Default Runtime: nvidia
~# kubectl describe node ubuntu-server-1 | grep /gpu aliyun.com/gpu-count: 1 aliyun.com/gpu-mem: 5 nvidia.com/gpu: 0 aliyun.com/gpu-count: 1 aliyun.com/gpu-mem: 5 nvidia.com/gpu: 0 aliyun.com/gpu-count 0 0 aliyun.com/gpu-mem 1 1 nvidia.com/gpu 0 0 ~# kubectl describe pod notebook-0001 | grep /gpu aliyun.com/gpu-mem: 1 aliyun.com/gpu-mem: 1
我知道了,这个插件压根没有任何GPU隔离作用,要限制某个pod的显存使用上限,这个插件是没有任何帮助的。 这个插件只是起pod的时候,看下之前分配出去多少显存,还能不能再往出分配显存。
阿里云提供了GPU资源隔离的方案,如果在阿里云上应用可以留下联系信息。
您好,最后计划是用 aliyun,请问如何联系您,我邮箱是 wawasoft@qq.com,微信是 onlytiancai
给一个pod分了1个G的显存,结果跑个tensorflow任务,把近6个G的显存全占完了,再起新的pod跑任务直接显存溢出了,请问如何解决,谢谢。
还需要其它信息的话,我再补充
~# kubectl version Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:14:22Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.4", GitCommit:"8d8aa39598534325ad77120c120a22b3a990b5ea", GitTreeState:"clean", BuildDate:"2020-03-12T20:55:23Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}
~# docker info
Client: Debug Mode: false
Server: Containers: 54 Running: 23 Paused: 0 Stopped: 31 Images: 38 Server Version: 19.03.6
Runtimes: nvidia runc Default Runtime: nvidia
~# kubectl describe node ubuntu-server-1 | grep /gpu aliyun.com/gpu-count: 1 aliyun.com/gpu-mem: 5 nvidia.com/gpu: 0 aliyun.com/gpu-count: 1 aliyun.com/gpu-mem: 5 nvidia.com/gpu: 0 aliyun.com/gpu-count 0 0 aliyun.com/gpu-mem 1 1 nvidia.com/gpu 0 0 ~# kubectl describe pod notebook-0001 | grep /gpu aliyun.com/gpu-mem: 1 aliyun.com/gpu-mem: 1