aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.64k stars 926 forks source link

Panic runtime error #5699

Closed oliverbeesley closed 7 months ago

oliverbeesley commented 7 months ago

Description

Observed Behavior: We are seeing pod going into a CrashLoopBackoff state after installing v0.34.0. The logs show:

{"level":"INFO","time":"2024-02-20T15:31:48.767Z","logger":"controller.provisioner","message":"stopping controller","commit":"17d6c05"} panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x1f0e4ce]

goroutine 280 [running]: github.com/aws/karpenter-provider-aws/pkg/cloudprovider.(CloudProvider).resolveNodeClassFromNodePool(0xc000d76370, {0x3156b18, 0xc001601980}, 0xc0020ee780) github.com/aws/karpenter-provider-aws/pkg/cloudprovider/cloudprovider.go:225 +0x6e github.com/aws/karpenter-provider-aws/pkg/cloudprovider.(CloudProvider).GetInstanceTypes(0xc000d76370, {0x3156b18, 0xc001601980}, 0xc0020ee780) github.com/aws/karpenter-provider-aws/pkg/cloudprovider/cloudprovider.go:150 +0x4b sigs.k8s.io/karpenter/pkg/cloudprovider/metrics.(decorator).GetInstanceTypes(0xc000d65640, {0x3156b18, 0xc001601980}, 0x31623a0?) sigs.k8s.io/karpenter@v0.34.0/pkg/cloudprovider/metrics/cloudprovider.go:141 +0x15c sigs.k8s.io/karpenter/pkg/controllers/provisioning.(Provisioner).NewScheduler(0xc000145140, {0x3156b18, 0xc001601980}, {0xc00162a800, 0x16, 0x20}, {0xc0012bd040, 0x7, 0x7}, {0x0}) sigs.k8s.io/karpenter@v0.34.0/pkg/controllers/provisioning/provisioner.go:230 +0x398 sigs.k8s.io/karpenter/pkg/controllers/provisioning.(Provisioner).Schedule(0xc000145140, {0x3156b18, 0xc001601980}) sigs.k8s.io/karpenter@v0.34.0/pkg/controllers/provisioning/provisioner.go:325 +0x2ed sigs.k8s.io/karpenter/pkg/controllers/provisioning.(Provisioner).Reconcile(0xc000145140, {0x3156b18, 0xc001601980}, {{{0x0?, 0x0?}, {0x0?, 0x0?}}}) sigs.k8s.io/karpenter@v0.34.0/pkg/controllers/provisioning/provisioner.go:129 +0x8e sigs.k8s.io/karpenter/pkg/operator/controller.(Singleton).reconcile(0xc000394960, {0x3156b18, 0xc001601980}) sigs.k8s.io/karpenter@v0.34.0/pkg/operator/controller/singleton.go:100 +0x265 sigs.k8s.io/karpenter/pkg/operator/controller.(Singleton).Start(0xc000394960, {0x3156b50, 0xc0003706e0}) sigs.k8s.io/karpenter@v0.34.0/pkg/operator/controller/singleton.go:88 +0x1f0 sigs.k8s.io/controller-runtime/pkg/manager.(runnableGroup).reconcile.func1(0xc000394a00) sigs.k8s.io/controller-runtime@v0.17.0/pkg/manager/runnable_group.go:223 +0xc8 created by sigs.k8s.io/controller-runtime/pkg/manager.(runnableGroup).reconcile in goroutine 279 sigs.k8s.io/controller-runtime@v0.17.0/pkg/manager/runnable_group.go:207 +0x19d

Expected Behavior:

Pods in running state

Reproduction Steps (Please include YAML):

resource "kubectl_manifest" "nodeclass" { count = var.karpenter_version == "v0.31.1" ? 0 : 1 yaml_body = <<YAML apiVersion: karpenter.k8s.aws/v1beta1 kind: EC2NodeClass metadata: name: default spec:

Required, resolves a default ami and userdata

  amiFamily: AL2
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: "${var.cluster_name}"
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: "${var.cluster_name}"
  instanceProfile: "KarpenterNodeInstanceProfile-${var.cluster_name}"

YAML }

resource "kubectl_manifest" "nodepool" { count = var.karpenter_version == "v0.31.1" ? 0 : 1 yaml_body = <<YAML apiVersion: karpenter.sh/v1beta1 kind: NodePool metadata: name: default spec: template: spec: nodeClassRef: name: default requirements:

Versions:

jonathan-innis commented 7 months ago

Looks like a duplicate of #5689 to me. You should check to see that you have a nodeClassRef set. This is a required field and should generally fail on NodePool apply if you don't have it.

jonathan-innis commented 7 months ago

Can you show the output of getting the EC2NodeClasses and NodePools on your cluster?

oliverbeesley commented 7 months ago

Thanks @jonathan-innis. Following the resolution in the linked issue also fixed it for me

jonathan-innis commented 7 months ago

Closed, since the issue is resolved.

jonathan-innis commented 7 months ago

@oliverbeesley Just out of my own curiousity, what was the exact issue? Were you missing the spec for the NodeClaim or NodePool on the cluster?