aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.73k stars 945 forks source link

TLS handshake error from API server #6898

Open sknmi opened 1 month ago

sknmi commented 1 month ago

Description

Observed Behavior:

karpenter-c595bb5d8-8r8jr controller {"level":"ERROR","time":"2024-08-30T08:06:16.304Z","logger":"webhook","message":"http: TLS handshake error from 10.x.x.x:40666: EOF\n","commit":"62a726c"}
karpenter-c595bb5d8-hzfgs controller {"level":"ERROR","time":"2024-08-30T08:07:18.550Z","logger":"webhook","message":"http: TLS handshake error from 10.x.x.x:58290: EOF\n","commit":"62a726c"}
karpenter-c595bb5d8-8r8jr controller {"level":"ERROR","time":"2024-08-30T08:07:18.571Z","logger":"webhook","message":"http: TLS handshake error from 10.x.x.x:55794: EOF\n","commit":"62a726c"}
karpenter-c595bb5d8-8r8jr controller {"level":"ERROR","time":"2024-08-30T08:07:18.572Z","logger":"webhook","message":"http: TLS handshake error from 10.x.x.x:55792: EOF\n","commit":"62a726c"}
karpenter-c595bb5d8-hzfgs controller {"level":"ERROR","time":"2024-08-30T08:08:10.419Z","logger":"webhook","message":"http: TLS handshake error from 10.x.x.x:43424: EOF\n","commit":"62a726c"}
karpenter-c595bb5d8-8r8jr controller {"level":"ERROR","time":"2024-08-30T08:08:10.427Z","logger":"webhook","message":"http: TLS handshake error from 10.x.x.x:52314: EOF\n","commit":"62a726c"}

Expected Behavior: No errors :) Reproduction Steps (Please include YAML): Karpenter on fargate in karpenter namespace. These messages started to appear after upgrading to 1.0.1 Versions:

sknmi commented 1 month ago

fixed with

webhook:
  enabled: false
levinedaniel commented 1 month ago

I don't think this issue should be closed. I am seeing a similar error in my log messages and require the webhook to remain enabled to facilitate the conversion to the latest api version for my resources.

ezh commented 1 month ago

I agree with @levinedaniel. What is the reason to mark solution as closed with

webhook:
  enabled: false

The webhook is broken.

Hronom commented 4 weeks ago

Same, v1.0.2. Please re-open.

Is disabling webhook an ok solution or some functionality will not work?

Hronom commented 4 weeks ago

cc @sknmi message above

sknmi commented 4 weeks ago

@Hronom reopened :)

m0untains commented 3 weeks ago

Also seeing this issue after upgrading to v0.37.3.

adawalli commented 3 weeks ago

Saw this issue on 0.37.3 and 1.0.1

AnkitBhalla22 commented 3 weeks ago

Seeing same in 1.0.2

liafizan commented 3 weeks ago

Below findings are incorrect


Here is my observation. Please let me know if this is incorrect:

Karpenter does not provide a ca-client bundle as we can see from here.

When I look at the CRD in my cluster, I can see that it has been injected with a caBundle:

 webhook:
      clientConfig:
        caBundle: Redacted...
        service:
          name: karpenter
          namespace: karpenter
          path: /conversion/karpenter.sh
          port: 8443
      conversionReviewVersions:
      - v1beta1
      - v1
  group: karpenter.sh

I believe this is happening through ca-injector. So this means, that client config for this webhook has a ca-bundle specified but karpenter uses knative to inject certificate data into karpernter-certsecret which comes from here.

So this means that CA for CRD & Webhooks do not match and hence the error. If this is correct, then may be we can look at the possible solutions


I am still not sure how CA bundle is injected in CRD and I did see at one point that the CA bundle in secret vs CRD was different.

jmdeal commented 2 weeks ago

This appears to be the same issue we saw with the our defaulting / validating webhooks previously, the original issue was closed out when those webhooks were disabled by default: https://github.com/kubernetes-sigs/karpenter/issues/718. I've been able to reproduce, and as with that issues there does not appear to be any actual impact to Karpenter's operation and the errors can be safely ignored.

From the original issue:

These TLS errors appear to be related to https://github.com/kubernetes/kubernetes/issues/109022 which states that these handshake errors may be generated by some caching mechanism that is happening in the standard library that causes TLS errors on a cert rotation.

@liafizan are you still running into this? The cert is injected by knative, and I've been unable to reproduce. If you're still encountering this, I'd recommend opening a separate issue. I don't think it's related to the TLS errors we're seeing here.

I am still not sure how CA bundle is injected in CRD and I did see at one point that the CA bundle in secret vs CRD was different.

I'm going to mark this issue as solved for now, but let us know if any of you believe this issue is impacting Karpenter's ability to operate.

laserpedro commented 1 week ago

Hello @jmdeal,

After upgrading to minor 0.37.5 to enable the deleting of webhooks when deployed with ArgoCD I see two things:

jmdeal commented 1 week ago

the second one is that my CRDs are not in version v1 and are still in v1beta1 so IMO the TLS handshake error is causing the conversion webhook to fail

This doesn't indicate any issue with the conversion webhook. If you're on any pre-1.0 version with the conversion webhooks, the storage version is still v1beta1. The conversion webhooks only exist on those versions to enable rollback from v1.0. Also, once you upgrade to v1, both versions will still be present on the CRD, one isn't automatically removed once all stored resources are converted. Instead, you want to look at .status.storedVersions on the CRDs. On Karpenter v1.0.5+ Karpenter will remove v1beta1 from the stored versions once all CRs have been successfully migrated.

laserpedro commented 1 week ago

@jmdeal thank you for your answer, I misunderstood the conversion webhook and thought is was the other way around, thanks for the clarification !

elihuj117 commented 1 week ago

We are seeing this same behavior. Upgrade from 0.37.0 to 1.0.3 (with a minor upgrade to 0.37.3 during the upgrade process). The error seems to be innocuous, but I wanted to see if there was any impact to the core functionality of Karpenter.

apurvabhandari commented 1 week ago

I have done the upgrade from 0.37.5 to 1.0.6 and still see this issue. I have enabled webhook in 0.37.5 and this error is from karpenter 1.0.6 {"level":"ERROR","time":"2024-10-09T14:27:06.147Z","logger":"webhook","message":"http: TLS handshake error from 10.214.2.206:34084: EOF\n","commit":"6174c75"} {"level":"ERROR","time":"2024-10-09T14:27:06.319Z","logger":"webhook","message":"http: TLS handshake error from 10.214.60.56:40108: EOF\n","commit":"6174c75"}

itayvolo commented 5 days ago

+1