Inf instances created by manage node groups missing resources

Description

copy of issue

When creating a managed node group of inf type instances through the EKS module, the nodes do not come with their neuron chips as resources, meaning workloads that require the chip cannot be scheduled to run, and if that requirement is gotten around, the node itself does not have neuron installed.

The online tutorials on how to set up inf1 only suggest using eksctl to start the node, which must mean that some configuration is going on that is hidden. I've been through what should be identical clusters one created via Terraform and eksctl and copied the configs of the various parts to investigate where the issue is being caused but I've made no progress.

I know that it seems that the AMI being used is the issue (since the neuron drivers are installed) but when trying to use their AMI the create_failed warning was flagged.

[x] ✋ I have searched the open/closed issues and my issue is not listed.

Versions

Terraform v1.1.9 on windows_386

provider registry.terraform.io/hashicorp/aws v4.12.1
provider registry.terraform.io/hashicorp/cloudinit v2.2.0
provider registry.terraform.io/hashicorp/kubernetes v2.11.0
provider registry.terraform.io/hashicorp/tls v3.3.0

Reproduction Code [Required]

Create a project with the following eks cluster

module "eks" {
    source = "terraform-aws-modules/eks/aws"

    cluster_name = var.cluster_name

    vpc_id = var.vpc_id
    subnet_ids = var.vpc_subnets_all

    # Use the pre-configured role
    create_iam_role = false
    iam_role_arn = var.cluster_role_arn

    eks_managed_node_group_defaults = {
        create_iam_role = false
        iam_role_arn    = var.node_group_role_arn
    }

    node_security_group_additional_rules = {
        egress_all = {
            description      = "Node all egress"
            protocol         = "-1"
            from_port        = 0
            to_port          = 0
            type             = "egress"
            cidr_blocks      = ["0.0.0.0/0"]
            ipv6_cidr_blocks = ["::/0"]
        }
    }

    eks_managed_node_groups = {

       ...

        model-inference = {
            instance_types = ["inf1.xlarge"] # inf1.6xlarge
            min_size     = 1
            desired_size = 1
            max_size     = 10
            subnet_ids = var.vpc_subnets_private
            capacity_type = "SPOT"
            labels = {
                subnet_privacy_type="private"
                target="models"
            }

            # Only allow pods that have a toleration for this in (this is how you select compute neuron)
            taints = [
                {
                    key    = "compute"
                    value  = "neuron"
                    effect = "NO_SCHEDULE"
                }
            ]
        }

Build the infrastructure
Deploy the neuron daemons set as described in https://awsdocs-neuron.readthedocs-hosted.com/en/v1.12.0/neuron-deploy/tutorial-k8s.html

I have added the service account definition in here and I've added the tolerance of the taint that I had added.

apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    k8s-addon: neuron-device-plugin.addons.k8s.io
    k8s-app: neuron-device-plugin
  name: neuron-device-plugin
  namespace: kube-system
---
# https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: neuron-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name:  neuron-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        name: neuron-device-plugin-ds
    spec:
      serviceAccount: neuron-device-plugin
      tolerations:
      - key: CriticalAddonsOnly
        operator: Exists
      - key: compute
        operator: Exists
        effect: NoSchedule
      - key: aws.amazon.com/neuron
        operator: Exists
        effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: "beta.kubernetes.io/instance-type"
                    operator: In
                    values:
                      - inf1.xlarge
                      - inf1.2xlarge
                      - inf1.6xlarge
                      - inf1.4xlarge
              - matchExpressions:
                  - key: "node.kubernetes.io/instance-type"
                    operator: In
                    values:
                      - inf1.xlarge
                      - inf1.2xlarge
                      - inf1.6xlarge
                      - inf1.24xlarge
      containers:
        #Device Plugin containers are available both in us-east and us-west ecr
        #repos
      - image: 790709498068.dkr.ecr.us-west-2.amazonaws.com/neuron-device-plugin:latest
        imagePullPolicy: Always
        name: neuron-device-plugin
        env:
        - name: KUBECONFIG
          value: /etc/kubernetes/kubelet.conf
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
          - name: infa-map
            mountPath: /run
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins
        - name: infa-map
          hostPath:
            path: /run

See that the daemonsets are deployed on the instances and then describe the node

Name:               ip-****.eu-west-2.compute.internal
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=inf1.xlarge
                    beta.kubernetes.io/os=linux
                    eks.amazonaws.com/capacityType=SPOT
                    eks.amazonaws.com/nodegroup=model-inference-20220504180239905900000009
                    eks.amazonaws.com/nodegroup-image=ami-0ea8b161ec7a54986
                    eks.amazonaws.com/sourceLaunchTemplateId=lt-0a488da3d2a2d5d13
                    eks.amazonaws.com/sourceLaunchTemplateVersion=3
                    failure-domain.beta.kubernetes.io/region=eu-west-2
                    failure-domain.beta.kubernetes.io/zone=eu-west-2b
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-*****.eu-west-2.compute.internal
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=inf1.xlarge
                    subnet_privacy_type=private
                    target=models
                    topology.kubernetes.io/region=eu-west-2
                    topology.kubernetes.io/zone=eu-west-2b
Annotations:        node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Fri, 06 May 2022 17:08:12 +0100
Taints:             compute=neuron:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  ip-*****.eu-west-2.compute.internal
  AcquireTime:     <unset>
  RenewTime:       Sat, 07 May 2022 12:02:25 +0100
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Sat, 07 May 2022 11:57:56 +0100   Fri, 06 May 2022 17:08:11 +0100   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Sat, 07 May 2022 11:57:56 +0100   Fri, 06 May 2022 17:08:11 +0100   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Sat, 07 May 2022 11:57:56 +0100   Fri, 06 May 2022 17:08:11 +0100   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Sat, 07 May 2022 11:57:56 +0100   Fri, 06 May 2022 17:08:31 +0100   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:   ****
  Hostname:     ip-*****.eu-west-2.compute.internal
  InternalDNS:  ip-******.eu-west-2.compute.internal
Capacity:
  attachable-volumes-aws-ebs:  39
  cpu:                         4
  ephemeral-storage:           20959212Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      7845076Ki
  pods:                        38
Allocatable:
  attachable-volumes-aws-ebs:  39
  cpu:                         3920m
  ephemeral-storage:           18242267924
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      7053524Ki
  pods:                        38
System Info:
  Machine ID:                 ec2e89510c5d7840fa0d56d16d72da12
  System UUID:                ec2e8951-0c5d-7840-fa0d-56d16d72da12
  Boot ID:                    1a02ba39-79bf-4d65-b48b-f71d31ba82d4
  Kernel Version:             5.4.188-104.359.amzn2.x86_64
  OS Image:                   Amazon Linux 2
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://20.10.13
  Kubelet Version:            v1.21.5-eks-9017834
  Kube-Proxy Version:         v1.21.5-eks-9017834
ProviderID:                   aws:///eu-west-2b/i-00b082ff750e4149f
Non-terminated Pods:          (4 in total)
  Namespace                   Name                                         CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                         ------------  ----------  ---------------  -------------  ---
  default                     image-processor-7bfd945799-4c449    3500m (89%)   4 (102%)    6Gi (89%)        6Gi (89%)      18h
  kube-system                 aws-node-h8kcv                               25m (0%)      0 (0%)      0 (0%)           0 (0%)         18h
  kube-system                 kube-proxy-9sh46                             100m (2%)     0 (0%)      0 (0%)           0 (0%)         18h
  kube-system                 neuron-device-plugin-daemonset-97cmk         0 (0%)        0 (0%)      0 (0%)           0 (0%)         18h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests     Limits
  --------                    --------     ------
  cpu                         3625m (92%)  4 (102%)
  memory                      6Gi (89%)    6Gi (89%)
  ephemeral-storage           0 (0%)       0 (0%)
  hugepages-1Gi               0 (0%)       0 (0%)
  hugepages-2Mi               0 (0%)       0 (0%)
  attachable-volumes-aws-ebs  0            0
Events:                       <none>

See that neuron is missing and the hugepages-2Mi isn't set

Expect to see

Capacity:
  attachable-volumes-aws-ebs:  39
  aws.amazon.com/neuron:       1
  cpu:                         4
  ephemeral-storage:           52416492Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               256Mi
  memory:                      7847964Ki
  pods:                        38
Allocatable:
  attachable-volumes-aws-ebs:  39
  aws.amazon.com/neuron:       1
  cpu:                         3920m
  ephemeral-storage:           47233297124
  hugepages-1Gi:               0
  hugepages-2Mi:               256Mi
  memory:                      6794268Ki
  pods:                        38

Expected behavior

That the node group model-inference created would have the resources of the instance type once they were started.

Actual behavior

The node does not have the neuron device available for scheduling, and even if that requirement is taken off, the actual node does have the device working.

switching the ami to ami-037d069dbf7d0c1bb an ami used by eksctl, it fails to create the node.

cloudposse / terraform-aws-eks-node-group