some problem about auto allocate GPU card?

guobingithub commented 4 years ago

hello, my gpu server which has 4 gpu cards(every one has 7611MiB), now three containers run on the card gpu0, they total used 7601MiB. then i run a new container, as expect this new container will run on gpu1 or gpu2 or gpu3. but it does not run on gpu1/gpu2/gpu3 at all!!! Actualy it run failed!(CrashLoopBackOff)!

root@server:~# root@server:~# kubectl get po NAME READY STATUS RESTARTS AGE binpack-1-5cb847f945-7dp5g 1/1 Running 0 3h33m binpack-2-7fb6b969f-s2fmh 1/1 Running 0 64m binpack-3-84d8979f89-d6929 1/1 Running 0 59m binpack-4-669844dd5f-q9wvm 0/1 **CrashLoopBackOff** 15 56m ngx-dep1-69c964c4b5-9d7cp 1/1 Running 0 102m root@server:~# root@server:~#

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 24689 C python 7227MiB | | 0 45236 C python 151MiB | | 0 47646 C python 213MiB | +-----------------------------------------------------------------------------+ root@server:~# root@server:~#`

and my binpack-4.yaml info is below: `root@server:/home/guobin/gpu-repo# cat binpack-4.yaml apiVersion: apps/v1 kind: Deployment

metadata: name: binpack-4 labels: app: binpack-4

spec: replicas: 1

selector: # define how the deployment finds the pods it manages matchLabels: app: binpack-4

template: # define the pods specifications metadata: labels: app: binpack-4

spec:
  containers:
  - name: binpack-4
    image: cheyang/gpu-player:v2
    resources:
      limits:
        # MiB
        aliyun.com/gpu-mem: 200`

as you can see, the aliyun.com/gpu-mem is 200MiB.

ok! these are all important info. Why this plugin can not auto allocate GPU card? or is there something i need to modify?

Thanks for your help!

guobingithub commented 4 years ago

@cheyang can you give me a help ? thanks very much.

cheyang commented 4 years ago

I think 200MiB is not enough to run the tensorflow applicaiton.

guobingithub commented 4 years ago

@cheyang OK, thank you! As you say, i set 7200MiB, but it did not work! binpack-4 still can not run up.

The problem is, i have 4 gpu cards, each card is 7611MiB, and i run binpack-1/binpack-2/binpack-3/binpack-4, these 4 containers all run on card gpu0! and binpack-4 run failed......

why these 4 containers can not run on other gpu card automatically??

cheyang commented 4 years ago

Did you install kubectl-inspect-gpushare? You can check it with the cli.

lizongnan commented 4 years ago

@guobingithub Did you solve the problem? I meet a same one and need help.

lizongnan commented 4 years ago

Hello @cheyang I have installed kubectl-inspect-gpushare. The following are information of 'kubectl inspect gpushare' and 'nvidia-smi'. As you can see, all pods need 18960MB GPU memory, which is significantly larger than the memory size of one GPU. Even so, these pods are not deployed to other GPUs( 1,2,3 GPUs in master and 0,1,2,3 GPUs in node6). So, what's the reason? Looking forward to your help! [root@master k8s] kubectl inspect gpushare NAME IPADDRESS GPU0(Allocated/Total) GPU1(Allocated/Total) GPU2(Allocated/Total) GPU3(Allocated/Total) PENDING(Allocated) GPU Memory(MiB) master 192.168.4.15 0/11178 0/11178 0/11178 0/11178 18960 18960/44712 node6 192.168.4.16 0/11178 0/11178 0/11178 0/11178 0/44712

Allocated/Total GPU Memory In Cluster: 18960/89424 (21%)

AliyunContainerService / gpushare-scheduler-extender

some problem about auto allocate GPU card? #101