aws-samples / aws-efa-eks

Deploying EFA in EKS utilizing GPUDirectRDMA where supported
MIT No Attribution
35 stars 19 forks source link

Unable to run EFA jobs on EKS ubuntu nodes #9

Closed ss12930 closed 1 year ago

ss12930 commented 1 year ago

We have an EKS cluster where we are trying to run EFA enabled jobs on ubuntu nodes, but unable to get this running due to networking issues on pods running on EFA enabled nodes. The error we see are usually related to pods communication , pods scheduled on EFA nodes are unable to communicate with each other.

We have observed that we are facing the issues only with EFA enabled ubuntu nodes , the setup is working fine for amazon linux nodes. We have been able to use EFA in non EKS EC2 ubuntu machines in the past but for some reason this is not working with EKS ubuntu nodes.

As for our use case we need to use ubuntu nodes only , We would like to understand if there are any differences in how vpc cni / EFA networking works for ubuntu and amazon linux. Also are there any similar use cases with successful implementation of EFA on Ubuntu nodes in EKS. This is a major blocker for us, any recommendations/references on this would be highly appreciated.

ss12930 commented 1 year ago

Duplicate