aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.17k stars 851 forks source link

How to add EFA to instances (or custom resources requests) #6296

Open j-vizcaino opened 1 month ago

j-vizcaino commented 1 month ago

Description

How can the docs be improved?

Despite seeing support for EFA in the code, it is unclear how to make this happen for good.

AFAICT, the EFA interfaces are enabled when the NodeClaim resources requests any number of vpc.amazonaws.com/efa devices, but since NodeClaim objects are created by Karpenter itself, it's unclear how to express that in either the Nodepool or the EC2NodeClass definition.

Nodepool used to have a spec.template.spec.resources that looked promising, but it's been removed in 0.33 (maybe that was a red herring?).

Any guidance to support this would be appreciated, and docs would benefit from providing an example how to achieve this.

Thanks!

jmdeal commented 1 month ago

NodeClaim requests are based on the resource requests made by pods the NodeClaim is created for. Do you have a use case where you want to configure EFA when no pods are requesting it?

j-vizcaino commented 1 month ago

Thank you @jmdeal, I was able to figure this out later yesterday but forgot to update the issue. Indeed, creating a pod with a resources.requests containing vpc.amazonaws.com/efa: x will pick the right instance and create it 🎉

Do you have a use case where you want to configure EFA when no pods are requesting it?

Actually, yes, but because of a correlation of issues, namely related to Karpenter being unaware of hugepages on target nodes. I'm aware that this is currently being discussed, but would gladly appreciate some guidance on this.

For the context: we're running GPU based nodes where we need support for EFA (as much as the node supports), but also hugepages, for GPU RDMA. In order to create the node, I cannot rely on "business" pods because Karpenter won't spin up nodes due to being unaware hugepages capacity. Therefore, I simply create a "fake/balloon" pod that requests all EFA interfaces that the underlying instance type supports and expect to see the instance coming up properly configured.

Now, the issue, is that once nodes are created, I would need to stop my fake/balloon pod to release the EFA devices for the real business pod to be scheduled. This makes orchestration harder to achieve, especially in the context of automatic replacement of nodes (think immutable infra, security compliance, ...)

In other words: having nodes come up with all EFA interfaces attached would greatly help. The fake balloon pods would allow for creating the nodes (no hugepages request, no EFA), and as soon as nodes would come up, the business pods would be scheduled.

j-vizcaino commented 1 month ago

Also, having EFA unconditionally for nodes of a nodepool would help with labelling the nodes, so that it's easier for aws-efa-k8s-device-plugin to target those nodes.

Right now, there's no way to differentiate (label-wise) 2 nodes coming from the same nodepool, where one would have EFA and the other wouldn't. This leads to the plugin on one node (no EFA) while successfully working on the other.

jmdeal commented 1 month ago

Got it, so it sounds like this is mainly a workaround for Karpenter not supporting hugepages. You can't use the business pods directly, so you use balloon pods which don't request hugepages and the business pods can schedule once the nodes are created and the resources are registered. Out of curiosity, what additional orchestration would you need to do if the balloon pods requested EFA resources? Are they not currently being preempted by the business pods?

Either way, I don't personally see any reason Karpenter shouldn't support this. IMO a reasonable semantic would be if vpc.amazonaws.com/efa is specified in requirements, instances would always be configured for EFA. Otherwise, the current dynamic behavior would be used. I'm not sure how much work this would entail, but I'm going to go ahead and change this to a feature request.

j-vizcaino commented 1 month ago

Out of curiosity, what additional orchestration would you need to do if the balloon pods requested EFA resources? Are they not currently being preempted by the business pods?

The balloon pods would be preempted indeed, but they would end up in pending state, further triggering Karpenter again, creating more nodes. Obviously, that's based on the assumption that those pods would be driven by a deployment. Turning this into a job-based solution would solve that problem, but it would prevent using those same balloon pods again for automatic node replacement.

Either way, I don't personally see any reason Karpenter shouldn't support this. IMO a reasonable semantic would be if vpc.amazonaws.com/efa is specified in requirements, instances would always be configured for EFA. Otherwise, the current dynamic behavior would be used. I'm not sure how much work this would entail, but I'm going to go ahead and change this to a feature request.

Thank you! To add more thoughts into this, it seems really important for Karpenter to support and understand EFA for nodepools because EFA usually requires a dedicated security group, as well as a custom label for the aws-efa-k8s-device-plugin to target those nodes only. It makes a lot of sense for this setup to be opinionated and encoded into configuration (nodepool + ec2nodeclass) for the rest of the infra to support this properly.