aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.21k stars 859 forks source link

Add support for Elastic Fabric Adapter (EFA) to Karpenter #3127

Closed iankouls-aws closed 7 months ago

iankouls-aws commented 1 year ago

Problem Statement:

Some workloads [distributed training, simulations, HPC applications] require high performance networking on AWS provided by instances that are enabled with Elastic Fabric Adapter. The specific instance types are documented here. Currently Karpenter does not recognize EFA resource requests or limits specified in Kubernetes manifests as described below.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: inflate
spec:
  replicas: 1
  selector:
    matchLabels:
      app: inflate
  template:
    metadata:
      labels:
        app: inflate
    spec:
      terminationGracePeriodSeconds: 0
      containers:
        - name: inflate
          image: public.ecr.aws/eks-distro/kubernetes/pause:3.2
          resources:
            requests:
              vpc.amazonaws.com/efa: 1
            limits:
              vpc.amazonaws.com/efa: 1

When such a manifest is applied to a Karpenter-enabled cluster, the Karpenter controller produces an error like the following:

ERROR   controller.provisioning Could not schedule pod, incompatible with provisioner "default", no instance type satisfied resources {"pods":"1","vpc.amazonaws.com/efa":"1"} and requirements karpenter.k8s.aws/instance-generation Exists >2, karpenter.sh/provisioner-name In [default], karpenter.sh/capacity-type In [on-demand spot], kubernetes.io/os In [linux], kubernetes.io/arch In [amd64], karpenter.k8s.aws/instance-category In [c m r]   {"commit": "f290d37-dirty", "pod": "default/inflate-fd9bc9f9b-xmkq2"}

Feature Request:

Add capability in Karpenter to recognize resource vpc.amazonaws.com/efa, identify, and provision a suitable EC2 instance type with EFA enabled.

njtran commented 1 year ago

IIUC, this falls under the scope of custom resources, which should be handled with https://github.com/aws/karpenter/pull/2390. @jonathan-innis can you confirm this?

sftim commented 1 year ago

We could also look to add early support for Dynamic Resource Allocation. See https://kubernetes.io/blog/2022/12/15/dynamic-resource-allocation/

EFAs are an interesting bit of (virtual) hardware because the OS-bypass networking can only happen within the same subnet. That then could mean that the scheduler needs to be aware of that limitation in order to place Pods appropriately.

There are some other considerations, such as what security group to use for the EFAs. I could imagine a large cluster having more than one kind of EFA, perhaps each different kind is associated with a different security group.

bwagner5 commented 1 year ago

EFAs are an interesting bit of (virtual) hardware because the OS-bypass networking can only happen within the same subnet. That then could mean that the scheduler needs to be aware of that limitation in order to place Pods appropriately.

There are some other considerations, such as what security group to use for the EFAs. I could imagine a large cluster having more than one kind of EFA, perhaps each different kind is associated with a different security group.

As a first step, it might be okay to leave the single subnet setup up to the user. They could configure a single provisioner with EFA support. Placement groups would also need to be setup, so it may be convenient to set both of those up at the Provisioner level and then target that provisioner. Security groups would also be setup at the provisioner level as normal.

iankouls-aws commented 1 year ago

Until this feature is implemented, a temporary workaround is documented here: https://github.com/aws-samples/aws-do-eks/tree/main/Container-Root/eks/deployment/karpenter#how-to-use-kaprenter-with-efa This is by no means a design for EFA support in Karpenter. It uses a custom launch template and mounts the EFA device in the pod, which requires privileged mode. The goal of this feature should be for pods to be able to specify vpc.amazonaws.com/efa: 1 resources and Karpenter to understand the request and add ec2 instances that are appropriate for the pods AND have EFA enabled.

bredr commented 1 year ago

This feature is really important for making distributed training on EKS viable and is currently forming a bottleneck for us training large models. Its especially hard getting the launch template working well in combination with mounting nvme disks.