eksctl-io / eksctl

The official CLI for Amazon EKS
https://eksctl.io
Other
4.83k stars 1.39k forks source link

[Bug] Inferentia and Trainium integration tests failing #6596

Closed cPu1 closed 1 year ago

cPu1 commented 1 year ago

Inferentia and Trainium integration tests are failing with this error:

[0] (Integration) Inferentia nodes cluster with inf nodes with --install-neuron-plugin=false when adding an unmanaged nodegroup by default [It] should install the neuron device plugin
[0] /__w/eksctl-ci/eksctl-ci/eksctl/integration/tests/inferentia/inferentia_test.go:165
[0] 
[0]   [FAILED] Unexpected error:
[0]       <*errors.StatusError | 0xc000351f40>: 
[0]       daemonsets.apps "neuron-device-plugin-daemonset" not found
[0]       {
[0]           ErrStatus: {
[0]               TypeMeta: {Kind: "", APIVersion: ""},
[0]               ListMeta: {
[0]                   SelfLink: "",
[0]                   ResourceVersion: "",
[0]                   Continue: "",
[0]                   RemainingItemCount: nil,
[0]               },
[0]               Status: "Failure",
[0]               Message: "daemonsets.apps \"neuron-device-plugin-daemonset\" not found",
[0]               Reason: "NotFound",
[0]               Details: {
[0]                   Name: "neuron-device-plugin-daemonset",
[0]                   Group: "apps",
[0]                   Kind: "daemonsets",
[0]                   UID: "",
[0]                   Causes: nil,
[0]                   RetryAfterSeconds: 0,
[0]               },
[0]               Code: 404,
[0]           },
[0]       }
[0]   occurred

Tranium

Trainium nodes cluster with trn nodes with --install-neuron-plugin=false when adding an unmanaged nodegroup by default [It] should install the neuron device plugin
[0] /__w/eksctl-ci/eksctl-ci/eksctl/integration/tests/trainium/trainium_test.go:181
[0] 
[0]   [FAILED] Unexpected error:
[0]       <*errors.StatusError | 0xc0000e7e00>: 
[0]       daemonsets.apps "neuron-device-plugin-daemonset" not found
[0]       {
[0]           ErrStatus: {
[0]               TypeMeta: {Kind: "", APIVersion: ""},
[0]               ListMeta: {
[0]                   SelfLink: "",
[0]                   ResourceVersion: "",
[0]                   Continue: "",
[0]                   RemainingItemCount: nil,
[0]               },
[0]               Status: "Failure",
[0]               Message: "daemonsets.apps \"neuron-device-plugin-daemonset\" not found",
[0]               Reason: "NotFound",
[0]               Details: {
[0]                   Name: "neuron-device-plugin-daemonset",
[0]                   Group: "apps",
[0]                   Kind: "daemonsets",
[0]                   UID: "",
[0]                   Causes: nil,
[0]                   RetryAfterSeconds: 0,
[0]               },
[0]               Code: 404,
[0]           },
[0]       }
[0]   occurred
[0]   In [It] at: /__w/eksctl-ci/eksctl-ci/eksctl/integration/tests/trainium/trainium_test.go:183
Himangini commented 1 year ago

Closing this since the related PR got reverted and these failures are no longer happening.