Xilinx / FPGA_as_a_Service

https://docs.xilinx.com/r/en-US/Xilinx_Kubernetes_Device_Plugin/Xilinx_Kubernetes_Device_Plugin
Apache License 2.0
143 stars 60 forks source link

Add XRT Edge support #15

Closed paroque28 closed 2 years ago

paroque28 commented 4 years ago

Tested with: Xilinx ZynqMP ZCU102, XRT 2019.2.

# kubectl apply -f fpga-device-plugin.yml 
daemonset.apps/fpga-device-plugin-daemonset configured
# kubectl get pods --namespace=kube-system
NAME                                      READY   STATUS             RESTARTS   AGE
fpga-device-plugin-daemonset-cpdwd        1/1     Running            0          13s

# kubectl logs fpga-device-plugin-daemonset-cpdwd --namespace=kube-system
time="2020-04-30T01:25:56Z" level=info msg="Starting FS watcher."
time="2020-04-30T01:25:56Z" level=info msg="Starting OS watcher."
time="2020-04-30T01:25:56Z" level=info msg="Starting to serve on /var/lib/kubelet/device-plugins/ZynqMP-ZCU102-fpga.sock"
2020/04/30 01:25:56 grpc: Server.Serve failed to create ServerTransport:  connection error: desc = "transport: write unix /var/lib/kubelet/device-plugins/ZynqMP-ZCU102-fpga.sock->@: write: broken pipe"
time="2020-04-30T01:25:56Z" level=info msg="Registered device plugin with Kubelet xilinx.com/fpga-ZynqMP-ZCU102"
time="2020-04-30T01:25:56Z" level=info msg="Sending 1 device(s) [&Device{ID:1,Health:Healthy,}] to kubelet"
time="2020-04-30T01:26:05Z" level=info msg="Receiving request 1"

# kubectl describe node zcu102-zynqmp

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                       Requests   Limits
  --------                       --------   ------
  cpu                            100m (2%)  0 (0%)
  memory                         70Mi (1%)  170Mi (4%)
  ephemeral-storage              0 (0%)     0 (0%)
  xilinx.com/fpga-ZynqMP-ZCU102  1          1

Test:

#  kubectl apply -f edge-vitis-test-pod.yaml
pod/vitis-test-pod created
# kubectl get pods

NAME             READY   STATUS      RESTARTS   AGE
vitis-test-pod   0/1     Completed   0          8s

# kubectl logs vitis-test-pod
Input Image size: 640 x 480 x 3
Input featuremap: 224 x 224
[TimeTest]dpuSetInputImage2             6914      us
[TimeTest]dpuRunTask                    52837     us
[TimeTest]dpuGetOutputTensorInHWCFP32   41        us
[TimeTest]CPUCalcSoftmax                86        us
TimeTest[0]  = 0.512263  name = grey fox, gray fox, Urocyon cinereoargenteus
[TimeTest]TopK                          108       us

This output proves the functioning of all the components.

Read all notes in the commit messages.

Build made with: https://github.com/Xilinx/FPGA_as_a_Service/pull/13

paroque28 commented 4 years ago

My only concern is the naming of the resource:

If we look at the Device Manager Proposal form Kubernetes: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-management/device-plugin.md

One can read that:

"When launching kubectl describe nodes, the devices appear in the node status as vendor-domain/vendor-device."

Here we are not following the naming convention and using resource names such as: "xilinx.com/fpga-xilinx_aws-vu9p-f1_dynamic_5_0-43981" Or in PCI something like: "xilinx.com/fpga+ "-" + device.shellVer + "-" + device.timestamp

What I want to note here is that fixing devices to a specific DSA version and timestamp makes it very difficult to orchestrate workloads. Let's say we have 1000 nodes FPGA ready and 500 of them are busy, how would you orchestrate the workload to run on the best FPGA if the resource name is not general for every FPGA. Is there any particular reason why you are doing this?

We could set the Kubernetes yaml file to something like this if we really need to fix a shell /XRT version:

spec:
  containers:
    - name: demo-container-1
      image: k8s.gcr.io/pause:2.0
      resources:
        limits:
          xilinx.com/fpga-xilinx: 1
          dsaversion: xilinx_aws-vu9p-f1_dynamic_5_0
          dsatimestamp: 43981

Instead of:

spec:
  containers:
    - name: demo-container-1
      image: k8s.gcr.io/pause:2.0
      resources:
        limits:
          xilinx.com/fpga-nx_aws-vu9p-f1_dynamic_5_0-43981: 1
paroque28 commented 4 years ago

@luciferlee

paroque28 commented 4 years ago

The other thing I can look at is how to read the DTB without privileged

paroque28 commented 4 years ago

https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#clusters-containing-different-types-of-gpus

yuzhang66 commented 4 years ago

https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#clusters-containing-different-types-of-gpus

Hi paroque28, Thanks so much for your contribution, now I'm working on this function with your code as reference and will be glad to let you know when we make an update of it.

paroque28 commented 4 years ago

Thanks @yuzhang66 for the effort