aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.79k stars 955 forks source link

Error from server: failed to prune fields: failed add back owned items: failed to convert pruned object at version karpenter.sh/v1: #6824

Closed ManuelMueller1st closed 1 month ago

ManuelMueller1st commented 2 months ago

Description

Observed Behavior: We've migrated from Karpenter 0.37.1 to 1.0.0. Now if I apply a NodePool the Karpenter pod logs the following error:

http: panic serving 10.250.97.76:40810: runtime error: invalid memory address or nil pointer dereference
goroutine 306281 [running]:
net/http.(*conn).serve.func1()
    net/http/server.go:1903 +0xbe
panic({0x277f360?, 0x4c9f9d0?})
    runtime/panic.go:770 +0x132
sigs.k8s.io/karpenter/pkg/apis/v1.(*NodeClaimTemplate).convertFrom(0xc008c34a10, {0x3476a98, 0xc0160db320}, 0xc0045a7308)
    sigs.k8s.io/karpenter@v1.0.0/pkg/apis/v1/nodepool_conversion.go:181 +0x19e
sigs.k8s.io/karpenter/pkg/apis/v1.(*NodePoolSpec).convertFrom(0xc008c34a10, {0x3476a98, 0xc0160db320}, 0xc0045a7308)
    sigs.k8s.io/karpenter@v1.0.0/pkg/apis/v1/nodepool_conversion.go:145 +0x106
sigs.k8s.io/karpenter/pkg/apis/v1.(*NodePool).ConvertFrom(0xc008c34908, {0x3476a98?, 0xc0160db320?}, {0x3458370?, 0xc0045a7200})
    sigs.k8s.io/karpenter@v1.0.0/pkg/apis/v1/nodepool_conversion.go:121 +0x15d
knative.dev/pkg/webhook/resourcesemantics/conversion.(*reconciler).convert(0xc000c27680, {0x3476a98, 0xc0160db140}, {{0xc000f09200, 0x8d6, 0x900}, {0x0, 0x0}}, {0xc008d21720, 0xf})
    knative.dev/pkg@v0.0.0-20231010144348-ca8c009405dd/webhook/resourcesemantics/conversion/conversion.go:137 +0x16d2
knative.dev/pkg/webhook/resourcesemantics/conversion.(*reconciler).Convert(0xc000c27680, {0x3476a98?, 0xc0160db0e0?}, 0xc0104f0040)
    knative.dev/pkg@v0.0.0-20231010144348-ca8c009405dd/webhook/resourcesemantics/conversion/conversion.go:57 +0x1e5
knative.dev/pkg/webhook.New.conversionHandler.func5({0x3467638, 0xc013205a40}, 0xc0119a8c60)
    knative.dev/pkg@v0.0.0-20231010144348-ca8c009405dd/webhook/conversion.go:66 +0x34a
net/http.HandlerFunc.ServeHTTP(0xc0012f0f80?, {0x3467638?, 0xc013205a40?}, 0x6c371f?)
    net/http/server.go:2171 +0x29
net/http.(*ServeMux).ServeHTTP(0xc0160daf90?, {0x3467638, 0xc013205a40}, 0xc0119a8c60)
    net/http/server.go:2688 +0x1ad
knative.dev/pkg/webhook.(*Webhook).ServeHTTP(0xc0012f0f00, {0x3467638, 0xc013205a40}, 0xc0119a8c60)
    knative.dev/pkg@v0.0.0-20231010144348-ca8c009405dd/webhook/webhook.go:310 +0xab
knative.dev/pkg/network/handlers.(*Drainer).ServeHTTP(0xc0004f9ea0, {0x3467638, 0xc013205a40}, 0xc0119a8c60)
    knative.dev/pkg@v0.0.0-20231010144348-ca8c009405dd/network/handlers/drain.go:113 +0x150
net/http.serverHandler.ServeHTTP({0x34521f0?}, {0x3467638?, 0xc013205a40?}, 0x6?)
    net/http/server.go:3142 +0x8e
net/http.(*conn).serve(0xc007fc7ef0, {0x3476a98, 0xc0012fe8d0})
    net/http/server.go:2044 +0x5e8
created by net/http.(*Server).Serve in goroutine 359
    net/http/server.go:3290 +0x4b4

Kubectl logs the following error:

Error from server: failed to prune fields: failed add back owned items: failed to convert pruned object at version karpenter.sh/v1: conversion webhook for karpenter.sh/v1beta1, Kind=NodePool failed: Post "https://karpenter.karpenter.svc:8443/conversion/karpenter.sh?timeout=30s": EOF

Here is the NodePool I want to apply:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  disruption:
    budgets:
      - nodes: 10%
    consolidateAfter: 0s
    consolidationPolicy: WhenEmptyOrUnderutilized
  limits:
    cpu: "180"
    memory: 720Gi
  template:
    metadata:
      labels:
        f3z/env: playground
        f3z/managed-by: karpenter
        f3z/nodegroup: default
        f3z/nodepool: default
    spec:
      expireAfter: 720h
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      requirements:
        - key: karpenter.k8s.aws/instance-hypervisor
          operator: In
          values:
            - nitro
        - key: kubernetes.io/arch
          operator: In
          values:
            - amd64
        - key: kubernetes.io/os
          operator: In
          values:
            - linux
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values:
            - "2"
        - key: karpenter.sh/capacity-type
          operator: In
          values:
            - spot
            - on-demand
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values:
            - c
            - m
            - r
        - key: karpenter.k8s.aws/instance-size
          operator: NotIn
          values:
            - nano
            - micro
            - small
            - medium

Expected Behavior:

NodePool gets applied without a error.

Reproduction Steps (Please include YAML):

Apply the yaml from above with 0.37.1, and reapply it with 1.0.0.

Versions:

jmdeal commented 2 months ago

My understanding from the reproduction steps is that I should be able to reproduce this by applying the provided NodePool on 0.37.1, upgrading Karpenter to 1.0.0, and reapplying the same NodePool after the upgrade has completed. I've been unable to replicate this with the provided NodePool, are you able to elaborate on the order of events? Specifically, could you elaborate on what you did to upgrade to 1.0.0 and if there were any other changes to resources in the cluster as part of that upgrade process?

ManuelMueller1st commented 2 months ago

I noticed that the error only occurs if we use kubectl apply --server-side. We followed the https://karpenter.sh/preview/upgrading/v1-migration/ instructions to upgrade to Karpenter 1.0.0.

sherifabdlnaby commented 2 months ago

I noticed that the error only occurs if we use kubectl apply --server-side. We followed the karpenter.sh/preview/upgrading/v1-migration instructions to upgrade to Karpenter 1.0.0.

Using client-side apply mitigated the issue for us. It's not perfect for out GitOps solution tho.

Ezcyo commented 2 months ago

Hi! Same setup on our side, upgrade from 0.37.1 to 1.0.0, post-upgrade webhooks passed successfully. We are trying to apply the following NodePool:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  annotations:
    compatibility.karpenter.sh/v1beta1-kubelet-conversion: '{"clusterDNS":["x.x.x.x"]}'
    compatibility.karpenter.sh/v1beta1-nodeclass-reference: '{"kind":"EC2NodeClass","name":"bottlerocket","apiVersion":"karpenter.k8s.aws/v1beta1"}'
  labels:
    kustomize.toolkit.fluxcd.io/name: karpenter-node-pool
    kustomize.toolkit.fluxcd.io/namespace: karpenter
  name: default-ondemand-amd64
spec:
  disruption:
    budgets:
    - nodes: 10%
    consolidateAfter: 0s
    consolidationPolicy: WhenEmptyOrUnderutilized
  limits:
    cpu: "100"
  template:
    spec:
      expireAfter: 720h
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: bottlerocket
      requirements:
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - on-demand
      - key: kubernetes.io/arch
        operator: In
        values:
        - amd64
      - key: karpenter.k8s.aws/instance-category
        operator: In
        values:
        - c
      - key: karpenter.k8s.aws/instance-family
        operator: In
        values:
        - c5a
        - c6a
      - key: karpenter.k8s.aws/instance-cpu
        operator: In
        values:
        - "4"
        - "8"
        - "16"
      startupTaints:
      - effect: NoExecute
        key: node.cilium.io/agent-not-ready

This results in the following error during apply:

NodePool/arm-ondemand dry-run failed, error: failed to prune fields: failed add back owned items: failed to convert pruned object at version karpenter.sh/v1: conversion webhook for karpenter.sh/v1beta1, Kind=NodePool failed: Post "https://karpenter.karpenter.svc:8443/conversion/karpenter.sh?timeout=30s": EOF

And the following traceback on the karpenter controller:


karpenter-6b4bd4c96c-nb2lf controller {"level":"ERROR","time":"2024-08-28T15:10:58.539Z","logger":"webhook","message":"http: panic serving 172.23.219.89:52172: runtime error: invalid memory address or nil pointer dereference\ngoroutine 34311 [running]:\nnet/http.(*conn).serve.func1()\n\tnet/http/server.go:1903 +0xb0\npanic({0x2225100?, 0x4734a10?})\n\truntime/panic.go:770 +0x124\nsigs.k8s.io/karpenter/pkg/apis/v1.(*NodeClaimTemplate).convertFrom(0x4005cdd310, {0x2f1bb28, 0x40070d5470}, 0x4001832b08)\n\tsigs.k8s.io/karpenter@v1.0.0/pkg/apis/v1/nodepool_conversion.go:181 +0x188\nsigs.k8s.io/karpenter/pkg/apis/v1.(*NodePoolSpec).convertFrom(0x4005cdd310, {0x2f1bb28, 0x40070d5470}, 0x4001832b08)\n\tsigs.k8s.io/karpenter@v1.0.0/pkg/apis/v1/nodepool_conversion.go:145 +0xe8\nsigs.k8s.io/karpenter/pkg/apis/v1.(*NodePool).ConvertFrom(0x4005cdd208, {0x2f1bb28?, 0x40070d5470?}, {0x2efd390?, 0x4001832a00})\n\tsigs.k8s.io/karpenter@v1.0.0/pkg/apis/v1/nodepool_conversion.go:121 +0x124\nknative.dev/pkg/webhook/resourcesemantics/conversion.(*reconciler).convert(0x40005c8d80, {0x2f1bb28, 0x40070d5320}, {{0x4005e9e6c0, 0x214, 0x240}, {0x0, 0x0}}, {0x40046fa8b0, 0xf})\n\tknative.dev/pkg@v0.0.0-20231010144348-ca8c009405dd/webhook/resourcesemantics/conversion/conversion.go:137 +0x119c\nknative.dev/pkg/webhook/resourcesemantics/conversion.(*reconciler).Convert(0x40005c8d80, {0x2f1bb28?, 0x40070d52c0?}, 0x40088e69c0)\n\tknative.dev/pkg@v0.0.0-20231010144348-ca8c009405dd/webhook/resourcesemantics/conversion/conversion.go:57 +0x174\nknative.dev/pkg/webhook.New.conversionHandler.func5({0x2f0c658, 0x40047736c0}, 0x4005e25200)\n\tknative.dev/pkg@v0.0.0-20231010144348-ca8c009405dd/webhook/conversion.go:66 +0x24c\nnet/http.HandlerFunc.ServeHTTP(0x4000d18080?, {0x2f0c658?, 0x40047736c0?}, 0x1d01d10?)\n\tnet/http/server.go:2171 +0x38\nnet/http.(*ServeMux).ServeHTTP(0x40070d5170?, {0x2f0c658, 0x40047736c0}, 0x4005e25200)\n\tnet/http/server.go:2688 +0x1a4\nknative.dev/pkg/webhook.(*Webhook).ServeHTTP(0x4000d18000, {0x2f0c658, 0x40047736c0}, 0x4005e25200)\n\tknative.dev/pkg@v0.0.0-20231010144348-ca8c009405dd/webhook/webhook.go:310 +0xc4\nknative.dev/pkg/network/handlers.(*Drainer).ServeHTTP(0x40004743f0, {0x2f0c658, 0x40047736c0}, 0x4005e25200)\n\tknative.dev/pkg@v0.0.0-20231010144348-ca8c009405dd/network/handlers/drain.go:113 +0x158\nnet/http.serverHandler.ServeHTTP({0x2ef71d0?}, {0x2f0c658?, 0x40047736c0?}, 0x6?)\n\tnet/http/server.go:3142 +0xbc\nnet/http.(*conn).serve(0x400a3701b0, {0x2f1bb28, 0x4000a21290})\n\tnet/http/server.go:2044 +0x508\ncreated by net/http.(*Server).Serve in goroutine 364\n\tnet/http/server.go:3290 +0x3f0\n","commit":"5bdf9c3"}```
We'll take a look at the client-side apply but this would not be ideal for the same reason as @sherifabdlnaby 
dschaaff commented 2 months ago

related https://github.com/aws/karpenter-provider-aws/issues/6867

jyotibhanot18 commented 1 month ago

What should be done?

engedaam commented 1 month ago

Closing this issue as a duplicate of https://github.com/aws/karpenter-provider-aws/issues/6867. Please follow there on the progress of this issue