awslabs / data-on-eks

DoEKS is a tool to build, deploy and scale Data & ML Platforms on Amazon EKS
https://awslabs.github.io/data-on-eks/
Apache License 2.0
556 stars 185 forks source link

Inf2 Worker Node Groups has multiple taints #478

Closed lindarr915 closed 3 months ago

lindarr915 commented 3 months ago

Description

Please provide a clear and concise description of the issue you are encountering, and a reproduction of your configuration.

In the node group Terraform file, the node group has multiple taints aws.amazon.com/neuron and aws.amazon.com/neuroncore [1][2]

But however for the device plugin has aws.amazon.com/neuron toleration only. YAML

This resulted in device plugin daemon is not inf2.24xlarge nodes.

      taints = [
        {
          key    = "aws.amazon.com/neuron",
          value  = "true",
          effect = "NO_SCHEDULE"
        },
        {
          key    = "aws.amazon.com/neuroncore",
          value  = "true",
          effect = "NO_SCHEDULE"
        },
      tolerations:
...
      - key: aws.amazon.com/neuron
        operator: Exists
        effect: NoSchedule

If your request is for a new feature, please use the Feature request template.

⚠️ Note

Before you submit an issue, please perform the following for Terraform examples:

  1. Remove the local .terraform directory (! ONLY if state is stored remotely, which hopefully you are following that best practice!): rm -rf .terraform/
  2. Re-initialize the project root to pull down modules: terraform init
  3. Re-attempt your terraform plan or apply and check if the issue still persists

Versions

Reproduction Code [Required]

Steps to reproduce the behavior:

Expected behavior

Actual behavior

Terminal Output Screenshot(s)

Additional context

lindarr915 commented 3 months ago

Inf2 Nodes by Karpenter contains two taints as well - it should be fixed

https://github.com/awslabs/data-on-eks/blob/main/ai-ml/trainium-inferentia/addons.tf