Closed lindarr915 closed 3 months ago
Please provide a clear and concise description of the issue you are encountering, and a reproduction of your configuration.
In the node group Terraform file, the node group has multiple taints aws.amazon.com/neuron and aws.amazon.com/neuroncore [1][2]
aws.amazon.com/neuron
aws.amazon.com/neuroncore
But however for the device plugin has aws.amazon.com/neuron toleration only. YAML
This resulted in device plugin daemon is not inf2.24xlarge nodes.
inf2.24xlarge
taints = [ { key = "aws.amazon.com/neuron", value = "true", effect = "NO_SCHEDULE" }, { key = "aws.amazon.com/neuroncore", value = "true", effect = "NO_SCHEDULE" },
tolerations: ... - key: aws.amazon.com/neuron operator: Exists effect: NoSchedule
If your request is for a new feature, please use the Feature request template.
Feature request
Before you submit an issue, please perform the following for Terraform examples:
.terraform
rm -rf .terraform/
terraform init
Module version [Required]:
Terraform version:
Provider version(s):
Steps to reproduce the behavior:
Inf2 Nodes by Karpenter contains two taints as well - it should be fixed
https://github.com/awslabs/data-on-eks/blob/main/ai-ml/trainium-inferentia/addons.tf
Description
Please provide a clear and concise description of the issue you are encountering, and a reproduction of your configuration.
In the node group Terraform file, the node group has multiple taints
aws.amazon.com/neuron
andaws.amazon.com/neuroncore
[1][2]But however for the device plugin has
aws.amazon.com/neuron
toleration only. YAMLThis resulted in device plugin daemon is not
inf2.24xlarge
nodes.If your request is for a new feature, please use the
Feature request
template.⚠️ Note
Before you submit an issue, please perform the following for Terraform examples:
.terraform
directory (! ONLY if state is stored remotely, which hopefully you are following that best practice!):rm -rf .terraform/
terraform init
Versions
Module version [Required]:
Terraform version:
Provider version(s):
Reproduction Code [Required]
Steps to reproduce the behavior:
Expected behavior
Actual behavior
Terminal Output Screenshot(s)
Additional context