Xilinx / FPGA_as_a_Service

https://docs.xilinx.com/r/en-US/Xilinx_Kubernetes_Device_Plugin/Xilinx_Kubernetes_Device_Plugin
Apache License 2.0
143 stars 60 forks source link

The fpga deamon-set plugin does not up as expect. #22

Closed lmaxeniro closed 3 years ago

lmaxeniro commented 3 years ago

I don't know what happens there, the fpga deamonset plugin previously works but now it totally "out-of-work". Here "out-of-work" means I can not get the deammonset pod and if I check the ds status specifically, I get below result:

$ kubectl get daemonset -n kube-system
NAME                           DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
fpga-device-plugin-daemonset   0         0         0       0            0           <none>                   115s
kube-flannel-ds                1         1         1       1            1           <none>                   96m
kube-proxy                     1         1         1       1            1           kubernetes.io/os=linux   98m
------
$ kubectl describe ds fpga-device-plugin-daemonset -n kube-system
Name:           fpga-device-plugin-daemonset
Selector:       name=xilinx-fpga-device-plugin
Node-Selector:  <none>
Labels:         <none>
Annotations:    deprecated.daemonset.template.generation: 1
Desired Number of Nodes Scheduled: 0
Current Number of Nodes Scheduled: 0
Number of Nodes Scheduled with Up-to-date Pods: 0
Number of Nodes Scheduled with Available Pods: 0
Number of Nodes Misscheduled: 0
Pods Status:  0 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:       name=xilinx-fpga-device-plugin
  Annotations:  scheduler.alpha.kubernetes.io/critical-pod: 
  Containers:
   xilinx-fpga-device-plugin:
    Image:        xilinx_k8s_fpga_plugin_lma:0.1
    Port:         <none>
    Host Port:    <none>
    Environment:  <none>
    Mounts:
      /var/lib/kubelet/device-plugins from device-plugin (rw)
  Volumes:
   device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:  
Events:            <none>

The other components of the Node work perfectly:

$ kubectl get pod -n kube-system
NAME                                     READY   STATUS    RESTARTS   AGE
coredns-f9fd979d6-6n4pj                  1/1     Running   0          74m
coredns-f9fd979d6-9w5wh                  1/1     Running   0          74m
etcd-xeniro-fpga-pc                      1/1     Running   0          74m
kube-apiserver-xeniro-fpga-pc            1/1     Running   0          74m
kube-controller-manager-xeniro-fpga-pc   1/1     Running   0          74m
kube-flannel-ds-gc9mp                    1/1     Running   0          22m
kube-proxy-jnssx                         1/1     Running   0          74m
kube-scheduler-xeniro-fpga-pc            1/1     Running   0          74m

All these happen after somewhat error in myself container failure (a FPGA test which got the container been evicted error) --but I already reboot the computer and all K8S stuff restart from scratch afterwards. For this fpga plugin image I have been using the docker image build directly from the current repo (previously it used the older version xilinxatg/xilinx_k8s_fpga_plugin, which is ~1 year ago tag)--all have the same issue. Please let me know what suggestion you have?

yuzhang66 commented 3 years ago

@lmaxeniro Hi Liang,

Is the fpga device plugin kube-system pod status shows evicted? If so, can you post the described result for the evicted pod?

We just finished the legal scan for a new fpga plugin image. We will update it very soon.

Thanks, Yu

lmaxeniro commented 3 years ago

@yuzhang66 Is the fpga device plugin kube-system pod status shows evicted?--yes it was, and after that, it was never getting back to normal even I reboot the machine--that looks likely something outside the Docker container is impacted?

However, the evict result can not be reproduced now so I can not provide the log now..

yuzhang66 commented 3 years ago

@lmaxeniro Looks you also tried to reinstal the plugin daemonset. This seems like a k8s or docker issue. I used met a similar situation is the new created device plugin pod is always being evicted and it's causing by the Docker Root Dir is full. I will do some research and let you know if I have any progress.

Thanks. Yu

lmaxeniro commented 3 years ago

@yuzhang66 Yes I have been tried to delete the plugin daemonset and re-create it, that didn't help. Thanks for the reminder, I just try to use docker command to check-- there is no container (xilinx-fpga-device-plugin) for the daemonset created. my volume size of Docker root dir is OK and I just using docker image prune to do a cleaning. So far I was still stuck here. Please share with any suggestion if you have..

lmaxeniro commented 3 years ago

@yuzhang66 after taking some time to debug--I finally find the root-cause, that I may forget to run kubectl taint to change the master node taint config (which default is NoSchedule)--after taint the master and re-create the deamonset I can get the DS run up. But checking into the fpga yaml file, it has something like toleration as CriticalAddonsOnly and operator= Exists, search around that seems the setup to helping easier the taint, but that does not take effect at all? What is the logic henind these setting--can you please give some explanation? Tks a lot.

lmaxeniro commented 3 years ago

@yuzhang66 one more question: is there the base docker image for ubuntu18.04 available? I think what was suggested in the document "FROM xilinxatg/aws-fpga-verify:20200131" is the CENTOS based... Or is that possible to include in the guide about what is necessary to be built in Docker file if I start from Ubuntu Docker image, I think XRT should be necessary, anything else?

yuzhang66 commented 3 years ago

@lmaxeniro For daemonset issue, I'm building a new k8s cluster to check if I can reproduce it. About the base docker image: Yes, you can use Ubuntu18.04 as base and just make sure your XRT version running in docker image is the same as the one you install on the worker node. I will add a more detailed tutorial in the coming update. Thanks.