Closed lmaxeniro closed 3 years ago
@lmaxeniro Hi Liang,
Is the fpga device plugin kube-system pod status shows evicted? If so, can you post the described result for the evicted pod?
We just finished the legal scan for a new fpga plugin image. We will update it very soon.
Thanks, Yu
@yuzhang66 Is the fpga device plugin kube-system pod status shows evicted?--yes it was, and after that, it was never getting back to normal even I reboot the machine--that looks likely something outside the Docker container is impacted?
However, the evict result can not be reproduced now so I can not provide the log now..
@lmaxeniro Looks you also tried to reinstal the plugin daemonset. This seems like a k8s or docker issue. I used met a similar situation is the new created device plugin pod is always being evicted and it's causing by the Docker Root Dir is full. I will do some research and let you know if I have any progress.
Thanks. Yu
@yuzhang66 Yes I have been tried to delete the plugin daemonset and re-create it, that didn't help. Thanks for the reminder, I just try to use docker command to check-- there is no container (xilinx-fpga-device-plugin) for the daemonset created. my volume size of Docker root dir is OK and I just using docker image prune to do a cleaning. So far I was still stuck here. Please share with any suggestion if you have..
@yuzhang66 after taking some time to debug--I finally find the root-cause, that I may forget to run kubectl taint to change the master node taint config (which default is NoSchedule)--after taint the master and re-create the deamonset I can get the DS run up. But checking into the fpga yaml file, it has something like toleration as CriticalAddonsOnly and operator= Exists, search around that seems the setup to helping easier the taint, but that does not take effect at all? What is the logic henind these setting--can you please give some explanation? Tks a lot.
@yuzhang66 one more question: is there the base docker image for ubuntu18.04 available? I think what was suggested in the document "FROM xilinxatg/aws-fpga-verify:20200131" is the CENTOS based... Or is that possible to include in the guide about what is necessary to be built in Docker file if I start from Ubuntu Docker image, I think XRT should be necessary, anything else?
@lmaxeniro For daemonset issue, I'm building a new k8s cluster to check if I can reproduce it. About the base docker image: Yes, you can use Ubuntu18.04 as base and just make sure your XRT version running in docker image is the same as the one you install on the worker node. I will add a more detailed tutorial in the coming update. Thanks.
I don't know what happens there, the fpga deamonset plugin previously works but now it totally "out-of-work". Here "out-of-work" means I can not get the deammonset pod and if I check the ds status specifically, I get below result:
The other components of the Node work perfectly:
All these happen after somewhat error in myself container failure (a FPGA test which got the container been evicted error) --but I already reboot the computer and all K8S stuff restart from scratch afterwards. For this fpga plugin image I have been using the docker image build directly from the current repo (previously it used the older version xilinxatg/xilinx_k8s_fpga_plugin, which is ~1 year ago tag)--all have the same issue. Please let me know what suggestion you have?