AliyunContainerService / gpushare-scheduler-extender

GPU Sharing Scheduler for Kubernetes Cluster
Apache License 2.0
1.39k stars 309 forks source link

some problem about auto allocate GPU card? #101

Open guobingithub opened 4 years ago

guobingithub commented 4 years ago

hello, my gpu server which has 4 gpu cards(every one has 7611MiB), now three containers run on the card gpu0, they total used 7601MiB. then i run a new container, as expect this new container will run on gpu1 or gpu2 or gpu3. but it does not run on gpu1/gpu2/gpu3 at all!!! Actualy it run failed!(CrashLoopBackOff)!

root@server:~# root@server:~# kubectl get po NAME READY STATUS RESTARTS AGE binpack-1-5cb847f945-7dp5g 1/1 Running 0 3h33m binpack-2-7fb6b969f-s2fmh 1/1 Running 0 64m binpack-3-84d8979f89-d6929 1/1 Running 0 59m binpack-4-669844dd5f-q9wvm 0/1 **CrashLoopBackOff** 15 56m ngx-dep1-69c964c4b5-9d7cp 1/1 Running 0 102m root@server:~# root@server:~#

my gpu server info: `root@server:~# nvidia-smi Wed May 20 18:18:17 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla P4 Off | 00000000:18:00.0 Off | 0 | | N/A 65C P0 25W / 75W | 7601MiB / 7611MiB | 2% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla P4 Off | 00000000:3B:00.0 Off | 0 | | N/A 35C P8 6W / 75W | 0MiB / 7611MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla P4 Off | 00000000:5E:00.0 Off | 0 | | N/A 32C P8 6W / 75W | 0MiB / 7611MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla P4 Off | 00000000:86:00.0 Off | 0 | | N/A 38C P8 7W / 75W | 0MiB / 7611MiB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 24689 C python 7227MiB | | 0 45236 C python 151MiB | | 0 47646 C python 213MiB | +-----------------------------------------------------------------------------+ root@server:~# root@server:~#`

and my binpack-4.yaml info is below: `root@server:/home/guobin/gpu-repo# cat binpack-4.yaml apiVersion: apps/v1 kind: Deployment

metadata: name: binpack-4 labels: app: binpack-4

spec: replicas: 1

selector: # define how the deployment finds the pods it manages matchLabels: app: binpack-4

template: # define the pods specifications metadata: labels: app: binpack-4

spec:
  containers:
  - name: binpack-4
    image: cheyang/gpu-player:v2
    resources:
      limits:
        # MiB
        aliyun.com/gpu-mem: 200`

as you can see, the aliyun.com/gpu-mem is 200MiB.

ok! these are all important info. Why this plugin can not auto allocate GPU card? or is there something i need to modify?

Thanks for your help!

guobingithub commented 4 years ago

@cheyang can you give me a help ? thanks very much.

cheyang commented 4 years ago

I think 200MiB is not enough to run the tensorflow applicaiton.

guobingithub commented 4 years ago

@cheyang OK, thank you! As you say, i set 7200MiB, but it did not work! binpack-4 still can not run up.

The problem is, i have 4 gpu cards, each card is 7611MiB, and i run binpack-1/binpack-2/binpack-3/binpack-4, these 4 containers all run on card gpu0! and binpack-4 run failed......

why these 4 containers can not run on other gpu card automatically??

cheyang commented 4 years ago

Did you install kubectl-inspect-gpushare? You can check it with the cli.

lizongnan commented 4 years ago

@guobingithub Did you solve the problem? I meet a same one and need help.

lizongnan commented 4 years ago

Hello @cheyang I have installed kubectl-inspect-gpushare. The following are information of 'kubectl inspect gpushare' and 'nvidia-smi'. As you can see, all pods need 18960MB GPU memory, which is significantly larger than the memory size of one GPU. Even so, these pods are not deployed to other GPUs( 1,2,3 GPUs in master and 0,1,2,3 GPUs in node6). So, what's the reason? Looking forward to your help! [root@master k8s] kubectl inspect gpushare NAME IPADDRESS GPU0(Allocated/Total) GPU1(Allocated/Total) GPU2(Allocated/Total) GPU3(Allocated/Total) PENDING(Allocated) GPU Memory(MiB) master 192.168.4.15 0/11178 0/11178 0/11178 0/11178 18960 18960/44712 node6 192.168.4.16 0/11178 0/11178 0/11178 0/11178 0/44712

Allocated/Total GPU Memory In Cluster: 18960/89424 (21%)

[root@master k8s] nvidia-smi Wed Aug 12 05:26:03 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 108... Off | 00000000:02:00.0 Off | N/A | | 23% 32C P8 8W / 250W | 11114MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 108... Off | 00000000:03:00.0 Off | N/A | | 23% 32C P8 9W / 250W | 10MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 GeForce GTX 108... Off | 00000000:82:00.0 Off | N/A | | 23% 36C P8 9W / 250W | 10MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 GeForce GTX 108... Off | 00000000:83:00.0 Off | N/A | | 23% 35C P8 10W / 250W | 10MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+