Open j-vizcaino opened 1 month ago
NodeClaim requests are based on the resource requests made by pods the NodeClaim is created for. Do you have a use case where you want to configure EFA when no pods are requesting it?
Thank you @jmdeal, I was able to figure this out later yesterday but forgot to update the issue.
Indeed, creating a pod with a resources.requests
containing vpc.amazonaws.com/efa: x
will pick the right instance and create it 🎉
Do you have a use case where you want to configure EFA when no pods are requesting it?
Actually, yes, but because of a correlation of issues, namely related to Karpenter being unaware of hugepages on target nodes. I'm aware that this is currently being discussed, but would gladly appreciate some guidance on this.
For the context: we're running GPU based nodes where we need support for EFA (as much as the node supports), but also hugepages, for GPU RDMA. In order to create the node, I cannot rely on "business" pods because Karpenter won't spin up nodes due to being unaware hugepages capacity. Therefore, I simply create a "fake/balloon" pod that requests all EFA interfaces that the underlying instance type supports and expect to see the instance coming up properly configured.
Now, the issue, is that once nodes are created, I would need to stop my fake/balloon pod to release the EFA devices for the real business pod to be scheduled. This makes orchestration harder to achieve, especially in the context of automatic replacement of nodes (think immutable infra, security compliance, ...)
In other words: having nodes come up with all EFA interfaces attached would greatly help. The fake balloon pods would allow for creating the nodes (no hugepages request, no EFA), and as soon as nodes would come up, the business pods would be scheduled.
Also, having EFA unconditionally for nodes of a nodepool would help with labelling the nodes, so that it's easier for aws-efa-k8s-device-plugin
to target those nodes.
Right now, there's no way to differentiate (label-wise) 2 nodes coming from the same nodepool, where one would have EFA and the other wouldn't. This leads to the plugin on one node (no EFA) while successfully working on the other.
Got it, so it sounds like this is mainly a workaround for Karpenter not supporting hugepages. You can't use the business pods directly, so you use balloon pods which don't request hugepages and the business pods can schedule once the nodes are created and the resources are registered. Out of curiosity, what additional orchestration would you need to do if the balloon pods requested EFA resources? Are they not currently being preempted by the business pods?
Either way, I don't personally see any reason Karpenter shouldn't support this. IMO a reasonable semantic would be if vpc.amazonaws.com/efa
is specified in requirements, instances would always be configured for EFA. Otherwise, the current dynamic behavior would be used. I'm not sure how much work this would entail, but I'm going to go ahead and change this to a feature request.
Out of curiosity, what additional orchestration would you need to do if the balloon pods requested EFA resources? Are they not currently being preempted by the business pods?
The balloon pods would be preempted indeed, but they would end up in pending state, further triggering Karpenter again, creating more nodes. Obviously, that's based on the assumption that those pods would be driven by a deployment. Turning this into a job-based solution would solve that problem, but it would prevent using those same balloon pods again for automatic node replacement.
Either way, I don't personally see any reason Karpenter shouldn't support this. IMO a reasonable semantic would be if vpc.amazonaws.com/efa is specified in requirements, instances would always be configured for EFA. Otherwise, the current dynamic behavior would be used. I'm not sure how much work this would entail, but I'm going to go ahead and change this to a feature request.
Thank you!
To add more thoughts into this, it seems really important for Karpenter to support and understand EFA for nodepools because EFA usually requires a dedicated security group, as well as a custom label for the aws-efa-k8s-device-plugin
to target those nodes only.
It makes a lot of sense for this setup to be opinionated and encoded into configuration (nodepool + ec2nodeclass) for the rest of the infra to support this properly.
Description
How can the docs be improved?
Despite seeing support for EFA in the code, it is unclear how to make this happen for good.
AFAICT, the EFA interfaces are enabled when the
NodeClaim
resources requests any number ofvpc.amazonaws.com/efa
devices, but sinceNodeClaim
objects are created by Karpenter itself, it's unclear how to express that in either theNodepool
or theEC2NodeClass
definition.Nodepool
used to have aspec.template.spec.resources
that looked promising, but it's been removed in 0.33 (maybe that was a red herring?).Any guidance to support this would be appreciated, and docs would benefit from providing an example how to achieve this.
Thanks!