aws-samples / aws-efa-eks

Deploying EFA in EKS utilizing GPUDirectRDMA where supported
MIT No Attribution
35 stars 19 forks source link

update EFA plugin image to v0.5.0 #22

Closed zachdorame closed 5 months ago

zachdorame commented 7 months ago

Issue #, if available: N/A

Description of changes: Update EFA plugin image to v0.5.0, update list of EFA-capable instances

Testing: Testing by applying manifest to a cluster. The EFA plugin logs show that the plugin is able to discover infiniband devices on a p5.48xlarge instance

doramebz@bcd074666b9c ~ % k logs -n kube-system aws-efa-k8s-device-plugin-daemonset-5xzrf
2024/03/13 22:16:00 Fetching EFA devices.
2024/03/13 22:16:00 device: rdmap79s0,uverbs0,/sys/class/infiniband_verbs/uverbs0,/sys/class/infiniband/rdmap79s0

2024/03/13 22:16:00 device: rdmap80s0,uverbs1,/sys/class/infiniband_verbs/uverbs1,/sys/class/infiniband/rdmap80s0

2024/03/13 22:16:00 device: rdmap81s0,uverbs2,/sys/class/infiniband_verbs/uverbs2,/sys/class/infiniband/rdmap81s0

2024/03/13 22:16:00 device: rdmap82s0,uverbs3,/sys/class/infiniband_verbs/uverbs3,/sys/class/infiniband/rdmap82s0

2024/03/13 22:16:00 device: rdmap96s0,uverbs4,/sys/class/infiniband_verbs/uverbs4,/sys/class/infiniband/rdmap96s0

2024/03/13 22:16:00 device: rdmap97s0,uverbs5,/sys/class/infiniband_verbs/uverbs5,/sys/class/infiniband/rdmap97s0

2024/03/13 22:16:00 device: rdmap98s0,uverbs6,/sys/class/infiniband_verbs/uverbs6,/sys/class/infiniband/rdmap98s0

2024/03/13 22:16:00 device: rdmap99s0,uverbs7,/sys/class/infiniband_verbs/uverbs7,/sys/class/infiniband/rdmap99s0

2024/03/13 22:16:00 device: rdmap113s0,uverbs8,/sys/class/infiniband_verbs/uverbs8,/sys/class/infiniband/rdmap113s0

2024/03/13 22:16:00 device: rdmap114s0,uverbs9,/sys/class/infiniband_verbs/uverbs9,/sys/class/infiniband/rdmap114s0

2024/03/13 22:16:00 device: rdmap115s0,uverbs10,/sys/class/infiniband_verbs/uverbs10,/sys/class/infiniband/rdmap115s0

2024/03/13 22:16:00 device: rdmap116s0,uverbs11,/sys/class/infiniband_verbs/uverbs11,/sys/class/infiniband/rdmap116s0

2024/03/13 22:16:00 device: rdmap130s0,uverbs12,/sys/class/infiniband_verbs/uverbs12,/sys/class/infiniband/rdmap130s0

2024/03/13 22:16:00 device: rdmap131s0,uverbs13,/sys/class/infiniband_verbs/uverbs13,/sys/class/infiniband/rdmap131s0

2024/03/13 22:16:00 device: rdmap132s0,uverbs14,/sys/class/infiniband_verbs/uverbs14,/sys/class/infiniband/rdmap132s0

2024/03/13 22:16:00 device: rdmap133s0,uverbs15,/sys/class/infiniband_verbs/uverbs15,/sys/class/infiniband/rdmap133s0

2024/03/13 22:16:00 device: rdmap147s0,uverbs16,/sys/class/infiniband_verbs/uverbs16,/sys/class/infiniband/rdmap147s0

2024/03/13 22:16:00 device: rdmap148s0,uverbs17,/sys/class/infiniband_verbs/uverbs17,/sys/class/infiniband/rdmap148s0

2024/03/13 22:16:00 device: rdmap149s0,uverbs18,/sys/class/infiniband_verbs/uverbs18,/sys/class/infiniband/rdmap149s0

2024/03/13 22:16:00 device: rdmap150s0,uverbs19,/sys/class/infiniband_verbs/uverbs19,/sys/class/infiniband/rdmap150s0

2024/03/13 22:16:00 device: rdmap164s0,uverbs20,/sys/class/infiniband_verbs/uverbs20,/sys/class/infiniband/rdmap164s0

2024/03/13 22:16:00 device: rdmap165s0,uverbs21,/sys/class/infiniband_verbs/uverbs21,/sys/class/infiniband/rdmap165s0

2024/03/13 22:16:00 device: rdmap166s0,uverbs22,/sys/class/infiniband_verbs/uverbs22,/sys/class/infiniband/rdmap166s0

2024/03/13 22:16:00 device: rdmap167s0,uverbs23,/sys/class/infiniband_verbs/uverbs23,/sys/class/infiniband/rdmap167s0

2024/03/13 22:16:00 device: rdmap181s0,uverbs24,/sys/class/infiniband_verbs/uverbs24,/sys/class/infiniband/rdmap181s0

2024/03/13 22:16:00 device: rdmap182s0,uverbs25,/sys/class/infiniband_verbs/uverbs25,/sys/class/infiniband/rdmap182s0

2024/03/13 22:16:00 device: rdmap183s0,uverbs26,/sys/class/infiniband_verbs/uverbs26,/sys/class/infiniband/rdmap183s0

2024/03/13 22:16:00 device: rdmap184s0,uverbs27,/sys/class/infiniband_verbs/uverbs27,/sys/class/infiniband/rdmap184s0

2024/03/13 22:16:00 device: rdmap198s0,uverbs28,/sys/class/infiniband_verbs/uverbs28,/sys/class/infiniband/rdmap198s0

2024/03/13 22:16:00 device: rdmap199s0,uverbs29,/sys/class/infiniband_verbs/uverbs29,/sys/class/infiniband/rdmap199s0

2024/03/13 22:16:00 device: rdmap200s0,uverbs30,/sys/class/infiniband_verbs/uverbs30,/sys/class/infiniband/rdmap200s0

2024/03/13 22:16:00 device: rdmap201s0,uverbs31,/sys/class/infiniband_verbs/uverbs31,/sys/class/infiniband/rdmap201s0

2024/03/13 22:16:00 EFA Device list: [{rdmap79s0 uverbs0 /sys/class/infiniband_verbs/uverbs0 /sys/class/infiniband/rdmap79s0} {rdmap80s0 uverbs1 /sys/class/infiniband_verbs/uverbs1 /sys/class/infiniband/rdmap80s0} {rdmap81s0 uverbs2 /sys/class/infiniband_verbs/uverbs2 /sys/class/infiniband/rdmap81s0} {rdmap82s0 uverbs3 /sys/class/infiniband_verbs/uverbs3 /sys/class/infiniband/rdmap82s0} {rdmap96s0 uverbs4 /sys/class/infiniband_verbs/uverbs4 /sys/class/infiniband/rdmap96s0} {rdmap97s0 uverbs5 /sys/class/infiniband_verbs/uverbs5 /sys/class/infiniband/rdmap97s0} {rdmap98s0 uverbs6 /sys/class/infiniband_verbs/uverbs6 /sys/class/infiniband/rdmap98s0} {rdmap99s0 uverbs7 /sys/class/infiniband_verbs/uverbs7 /sys/class/infiniband/rdmap99s0} {rdmap113s0 uverbs8 /sys/class/infiniband_verbs/uverbs8 /sys/class/infiniband/rdmap113s0} {rdmap114s0 uverbs9 /sys/class/infiniband_verbs/uverbs9 /sys/class/infiniband/rdmap114s0} {rdmap115s0 uverbs10 /sys/class/infiniband_verbs/uverbs10 /sys/class/infiniband/rdmap115s0} {rdmap116s0 uverbs11 /sys/class/infiniband_verbs/uverbs11 /sys/class/infiniband/rdmap116s0} {rdmap130s0 uverbs12 /sys/class/infiniband_verbs/uverbs12 /sys/class/infiniband/rdmap130s0} {rdmap131s0 uverbs13 /sys/class/infiniband_verbs/uverbs13 /sys/class/infiniband/rdmap131s0} {rdmap132s0 uverbs14 /sys/class/infiniband_verbs/uverbs14 /sys/class/infiniband/rdmap132s0} {rdmap133s0 uverbs15 /sys/class/infiniband_verbs/uverbs15 /sys/class/infiniband/rdmap133s0} {rdmap147s0 uverbs16 /sys/class/infiniband_verbs/uverbs16 /sys/class/infiniband/rdmap147s0} {rdmap148s0 uverbs17 /sys/class/infiniband_verbs/uverbs17 /sys/class/infiniband/rdmap148s0} {rdmap149s0 uverbs18 /sys/class/infiniband_verbs/uverbs18 /sys/class/infiniband/rdmap149s0} {rdmap150s0 uverbs19 /sys/class/infiniband_verbs/uverbs19 /sys/class/infiniband/rdmap150s0} {rdmap164s0 uverbs20 /sys/class/infiniband_verbs/uverbs20 /sys/class/infiniband/rdmap164s0} {rdmap165s0 uverbs21 /sys/class/infiniband_verbs/uverbs21 /sys/class/infiniband/rdmap165s0} {rdmap166s0 uverbs22 /sys/class/infiniband_verbs/uverbs22 /sys/class/infiniband/rdmap166s0} {rdmap167s0 uverbs23 /sys/class/infiniband_verbs/uverbs23 /sys/class/infiniband/rdmap167s0} {rdmap181s0 uverbs24 /sys/class/infiniband_verbs/uverbs24 /sys/class/infiniband/rdmap181s0} {rdmap182s0 uverbs25 /sys/class/infiniband_verbs/uverbs25 /sys/class/infiniband/rdmap182s0} {rdmap183s0 uverbs26 /sys/class/infiniband_verbs/uverbs26 /sys/class/infiniband/rdmap183s0} {rdmap184s0 uverbs27 /sys/class/infiniband_verbs/uverbs27 /sys/class/infiniband/rdmap184s0} {rdmap198s0 uverbs28 /sys/class/infiniband_verbs/uverbs28 /sys/class/infiniband/rdmap198s0} {rdmap199s0 uverbs29 /sys/class/infiniband_verbs/uverbs29 /sys/class/infiniband/rdmap199s0} {rdmap200s0 uverbs30 /sys/class/infiniband_verbs/uverbs30 /sys/class/infiniband/rdmap200s0} {rdmap201s0 uverbs31 /sys/class/infiniband_verbs/uverbs31 /sys/class/infiniband/rdmap201s0}]
2024/03/13 22:16:00 Starting FS watcher.
2024/03/13 22:16:00 Starting OS watcher.
2024/03/13 22:16:00 device: rdmap79s0,uverbs0,/sys/class/infiniband_verbs/uverbs0,/sys/class/infiniband/rdmap79s0

2024/03/13 22:16:00 device: rdmap80s0,uverbs1,/sys/class/infiniband_verbs/uverbs1,/sys/class/infiniband/rdmap80s0

2024/03/13 22:16:00 device: rdmap81s0,uverbs2,/sys/class/infiniband_verbs/uverbs2,/sys/class/infiniband/rdmap81s0

2024/03/13 22:16:00 device: rdmap82s0,uverbs3,/sys/class/infiniband_verbs/uverbs3,/sys/class/infiniband/rdmap82s0

2024/03/13 22:16:00 device: rdmap96s0,uverbs4,/sys/class/infiniband_verbs/uverbs4,/sys/class/infiniband/rdmap96s0

2024/03/13 22:16:00 device: rdmap97s0,uverbs5,/sys/class/infiniband_verbs/uverbs5,/sys/class/infiniband/rdmap97s0

2024/03/13 22:16:00 device: rdmap98s0,uverbs6,/sys/class/infiniband_verbs/uverbs6,/sys/class/infiniband/rdmap98s0

2024/03/13 22:16:00 device: rdmap99s0,uverbs7,/sys/class/infiniband_verbs/uverbs7,/sys/class/infiniband/rdmap99s0

2024/03/13 22:16:00 device: rdmap113s0,uverbs8,/sys/class/infiniband_verbs/uverbs8,/sys/class/infiniband/rdmap113s0

2024/03/13 22:16:00 device: rdmap114s0,uverbs9,/sys/class/infiniband_verbs/uverbs9,/sys/class/infiniband/rdmap114s0

2024/03/13 22:16:00 device: rdmap115s0,uverbs10,/sys/class/infiniband_verbs/uverbs10,/sys/class/infiniband/rdmap115s0

2024/03/13 22:16:00 device: rdmap116s0,uverbs11,/sys/class/infiniband_verbs/uverbs11,/sys/class/infiniband/rdmap116s0

2024/03/13 22:16:00 device: rdmap130s0,uverbs12,/sys/class/infiniband_verbs/uverbs12,/sys/class/infiniband/rdmap130s0

2024/03/13 22:16:00 device: rdmap131s0,uverbs13,/sys/class/infiniband_verbs/uverbs13,/sys/class/infiniband/rdmap131s0

2024/03/13 22:16:00 device: rdmap132s0,uverbs14,/sys/class/infiniband_verbs/uverbs14,/sys/class/infiniband/rdmap132s0

2024/03/13 22:16:00 device: rdmap133s0,uverbs15,/sys/class/infiniband_verbs/uverbs15,/sys/class/infiniband/rdmap133s0

2024/03/13 22:16:00 device: rdmap147s0,uverbs16,/sys/class/infiniband_verbs/uverbs16,/sys/class/infiniband/rdmap147s0

2024/03/13 22:16:00 device: rdmap148s0,uverbs17,/sys/class/infiniband_verbs/uverbs17,/sys/class/infiniband/rdmap148s0

2024/03/13 22:16:00 device: rdmap149s0,uverbs18,/sys/class/infiniband_verbs/uverbs18,/sys/class/infiniband/rdmap149s0

2024/03/13 22:16:00 device: rdmap150s0,uverbs19,/sys/class/infiniband_verbs/uverbs19,/sys/class/infiniband/rdmap150s0

2024/03/13 22:16:00 device: rdmap164s0,uverbs20,/sys/class/infiniband_verbs/uverbs20,/sys/class/infiniband/rdmap164s0

2024/03/13 22:16:00 device: rdmap165s0,uverbs21,/sys/class/infiniband_verbs/uverbs21,/sys/class/infiniband/rdmap165s0

2024/03/13 22:16:00 device: rdmap166s0,uverbs22,/sys/class/infiniband_verbs/uverbs22,/sys/class/infiniband/rdmap166s0

2024/03/13 22:16:00 device: rdmap167s0,uverbs23,/sys/class/infiniband_verbs/uverbs23,/sys/class/infiniband/rdmap167s0

2024/03/13 22:16:00 device: rdmap181s0,uverbs24,/sys/class/infiniband_verbs/uverbs24,/sys/class/infiniband/rdmap181s0

2024/03/13 22:16:00 device: rdmap182s0,uverbs25,/sys/class/infiniband_verbs/uverbs25,/sys/class/infiniband/rdmap182s0

2024/03/13 22:16:00 device: rdmap183s0,uverbs26,/sys/class/infiniband_verbs/uverbs26,/sys/class/infiniband/rdmap183s0

2024/03/13 22:16:00 device: rdmap184s0,uverbs27,/sys/class/infiniband_verbs/uverbs27,/sys/class/infiniband/rdmap184s0

2024/03/13 22:16:00 device: rdmap198s0,uverbs28,/sys/class/infiniband_verbs/uverbs28,/sys/class/infiniband/rdmap198s0

2024/03/13 22:16:00 device: rdmap199s0,uverbs29,/sys/class/infiniband_verbs/uverbs29,/sys/class/infiniband/rdmap199s0

2024/03/13 22:16:00 device: rdmap200s0,uverbs30,/sys/class/infiniband_verbs/uverbs30,/sys/class/infiniband/rdmap200s0

2024/03/13 22:16:00 device: rdmap201s0,uverbs31,/sys/class/infiniband_verbs/uverbs31,/sys/class/infiniband/rdmap201s0

2024/03/13 22:16:00 Starting to serve on /var/lib/kubelet/device-plugins/aws-efa-device-plugin.sock
2024/03/13 22:16:00 Registered device plugin with Kubelet

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

willgleich commented 5 months ago

Seems like this is an easy approve and merge? @zachdorame @bryantbiggs @amrragab8080

bryantbiggs commented 5 months ago

hey, apologies - I don't have merge rights currently but I can track down someone who does

also, we have an helm chart for this now, if we could update to that it would be quite helpful https://github.com/aws/eks-charts/tree/master/stable/aws-efa-k8s-device-plugin

@zachdorame is that something you are up to doing or should we catch in a follow up PR?

willgleich commented 5 months ago

Agree, the better decision is to update the documentation then and point users to install the helm chart. Could deprecate this yaml here.

Edit: You may break quite a few samples and repos by removing it though - https://github.com/search?q=efa-k8s-device-plugin.yml&type=code

zachdorame commented 5 months ago

hey, I'd like to deprecate this yaml but as @willgleich pointed out it would break quite a few samples, so I'm not sure of the right approach. @bryantbiggs I do have a PR open to update the helm chart: https://github.com/aws/eks-charts/pull/1069

Also, apologies, this PR is out of date since EFA plugin's latest version is 0.5.0 and no longer 0.4.4, so I'll put out a revision here.