When creating a managed node group of inf type instances through the EKS module, the nodes do not come with their neuron chips as resources, meaning workloads that require the chip cannot be scheduled to run, and if that requirement is gotten around, the node itself does not have neuron installed.
The online tutorials on how to set up inf1 only suggest using eksctl to start the node, which must mean that some configuration is going on that is hidden. I've been through what should be identical clusters one created via Terraform and eksctl and copied the configs of the various parts to investigate where the issue is being caused but I've made no progress.
I know that it seems that the AMI being used is the issue (since the neuron drivers are installed) but when trying to use their AMI the create_failed warning was flagged.
[x] ✋ I have searched the open/closed issues and my issue is not listed.
That the node group model-inference created would have the resources of the instance type once they were started.
Actual behavior
The node does not have the neuron device available for scheduling, and even if that requirement is taken off, the actual node does have the device working.
switching the ami to ami-037d069dbf7d0c1bb an ami used by eksctl, it fails to create the node.
Description
copy of issue
When creating a managed node group of inf type instances through the EKS module, the nodes do not come with their neuron chips as resources, meaning workloads that require the chip cannot be scheduled to run, and if that requirement is gotten around, the node itself does not have neuron installed.
The online tutorials on how to set up inf1 only suggest using eksctl to start the node, which must mean that some configuration is going on that is hidden. I've been through what should be identical clusters one created via Terraform and eksctl and copied the configs of the various parts to investigate where the issue is being caused but I've made no progress.
I know that it seems that the AMI being used is the issue (since the neuron drivers are installed) but when trying to use their AMI the create_failed warning was flagged.
Versions
Terraform v1.1.9 on windows_386
Reproduction Code [Required]
Build the infrastructure
Deploy the neuron daemons set as described in https://awsdocs-neuron.readthedocs-hosted.com/en/v1.12.0/neuron-deploy/tutorial-k8s.html
I have added the service account definition in here and I've added the tolerance of the taint that I had added.
hugepages-2Mi
isn't setExpect to see
Expected behavior
That the node group
model-inference
created would have the resources of the instance type once they were started.Actual behavior
The node does not have the neuron device available for scheduling, and even if that requirement is taken off, the actual node does have the device working.
switching the ami to ami-037d069dbf7d0c1bb an ami used by eksctl, it fails to create the node.
Terminal Output Screenshot(s)