Unable to create/run pods in K8 1.28

alanlyne commented 8 months ago

Describe the bug

When upgrading our K8 clusters to 1.28, we are now longer able to run or create pods. This works fine with 1.27. We were on an older release of Fabric8 but have since upgraded but have the exact same issues. The error occurs with any attempt to create or run a pod but for the purpose of this, this is the method we are trying.

client.run().inNamespace(namespace).withNewRunConfig()
    .withRestartPolicy("Never")
    .withName(name)
    .withImage(image)
    .withArgs("sh", "-c", "trap : TERM INT; sleep infinity & wait")
    .done();

After 30second, or so, of running this we receive the error shown below. This error is affectively the same irrespective of the way we create a pod.

I have looked through the issue history but was unable to find anyone else with a similar issue.

Using kubectl apply -f pod.yaml works as expected.

Fabric8 Kubernetes Client version

6.10.0

Steps to reproduce

Attempt to create/run a pod with the latest Fabric8 version. Will fail after a few seconds. Occurs on all of our 1.28 clusters.

Expected behavior

Create/run a pod without error

Runtime

Kubernetes (vanilla)

Kubernetes API Server version

other (please specify in additional context)

Environment

Windows

Fabric8 Kubernetes Client Logs

2024-01-15 10:32:46 ERROR Main - An error has occurred.
io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred.
        at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:129) ~[kubernetes-client-api-6.10.0.jar:?]
        at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:122) ~[kubernetes-client-api-6.10.0.jar:?]
        at io.fabric8.kubernetes.client.dsl.internal.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:44) ~[kubernetes-client-6.10.0.jar:?]
        at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.create(BaseOperation.java:1148) ~[kubernetes-client-6.10.0.jar:?]
        at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.create(BaseOperation.java:97) ~[kubernetes-client-6.10.0.jar:?]
        at io.fabric8.kubernetes.client.extended.run.RunOperations.done(RunOperations.java:107) ~[kubernetes-client-api-6.10.0.jar:?]
        at io.fabric8.kubernetes.client.extended.run.RunOperations$RunConfigNested.done(RunOperations.java:33) ~[kubernetes-client-api-6.10.0.jar:?]
        at com.qad.qo.commands.db.MariaDBClientCommand.call(MariaDBClientCommand.java:105) ~[classes/:?]
        at com.qad.qo.commands.db.SessionCommand.call(SessionCommand.java:15) ~[classes/:?]
        at com.qad.qo.commands.db.SessionCommand.call(SessionCommand.java:7) ~[classes/:?]
        at picocli.CommandLine.executeUserObject(CommandLine.java:2041) ~[picocli-4.7.5.jar:4.7.5]
        at picocli.CommandLine.access$1500(CommandLine.java:148) ~[picocli-4.7.5.jar:4.7.5]
        at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2461) ~[picocli-4.7.5.jar:4.7.5]
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2453) ~[picocli-4.7.5.jar:4.7.5]
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2415) ~[picocli-4.7.5.jar:4.7.5]
        at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2273) ~[picocli-4.7.5.jar:4.7.5]        at picocli.CommandLine$RunLast.execute(CommandLine.java:2417) ~[picocli-4.7.5.jar:4.7.5]
        at picocli.CommandLine.execute(CommandLine.java:2170) [picocli-4.7.5.jar:4.7.5]
        at com.qad.qo.Main.main(Main.java:43) [classes/:?]
Caused by: java.io.IOException: Canceled
        at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:504) ~[kubernetes-client-6.10.0.jar:?]
        at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:524) ~[kubernetes-client-6.10.0.jar:?]
        at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleCreate(OperationSupport.java:340) ~[kubernetes-client-6.10.0.jar:?]
        at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleCreate(BaseOperation.java:753) ~[kubernetes-client-6.10.0.jar:?]
        at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleCreate(BaseOperation.java:97) ~[kubernetes-client-6.10.0.jar:?]
        at io.fabric8.kubernetes.client.dsl.internal.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:42) ~[kubernetes-client-6.10.0.jar:?]
        ... 16 more
Caused by: java.io.IOException: Canceled
        at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:121) ~[okhttp-3.12.12.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[okhttp-3.12.12.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) ~[okhttp-3.12.12.jar:?]
        at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:257) ~[okhttp-3.12.12.jar:?]
        at okhttp3.RealCall$AsyncCall.execute(RealCall.java:201) ~[okhttp-3.12.12.jar:?]
        at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) ~[okhttp-3.12.12.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        at java.lang.Thread.run(Thread.java:833) ~[?:?]

Additional context

1.28.4-eks-8cb36c9

rohanKanojia commented 8 months ago

@alanlyne : Are you able to reproduce this issue on some other Kubernetes Cluster? Could this be related to some cluster misconfiguration? I tried this on kind with Kubernetes 1.29.0 but couldn't reproduce it.

alanlyne commented 8 months ago

@alanlyne : Are you able to reproduce this issue on some other Kubernetes Cluster? Could this be related to some cluster misconfiguration? I tried this on kind with Kubernetes 1.29.0 but couldn't reproduce it.

Yes, I've tried on 3 clusters in total. Two where on 1.28 and both had the same issue. One of the two was a clean cluster set up in AWS and the other had some extra bloat on it, argocd, linerd etc. The 3rd was a 1.27 cluster and here it worked fine.

alanlyne commented 8 months ago

Looks like we found the issue. Seems the client was cancelling the request. Increasing everything with the word timeout resolved the issue. More work on our side to find the correct values etc but that's the cause anyway it seems.

Config config = new ConfigBuilder(Config.empty()).withConnectionTimeout(60 * 1000).withRequestTimeout(60*1000).withUploadRequestTimeout(60 * 1000).withMasterUrl(eksEndpoint).withOauthTokenProvider(authTokenProvider).withTrustCerts().build();

manusa commented 8 months ago

Seems the client was cancelling the request.

Why was the client cancelling the request? Are the default timeouts too low for your setup?

alanlyne commented 8 months ago

Seems the client was cancelling the request.

Why was the client cancelling the request? Are the default timeouts too low for your setup?

Seems like that was the issue, increasing the request timeout, alone, to 40*1000 resolved that issue for us and we have not ran into the same issue since this change.

stale[bot] commented 5 months ago

This issue has been automatically marked as stale because it has not had any activity since 90 days. It will be closed if no further activity occurs within 7 days. Thank you for your contributions!

chadlwilson commented 5 months ago

Has anyone figured out any root cause that might have changed speed in some scenarios when moving from 1.27 to 1.28+? 10s is quite a long time, let alone 30s or 40s.

stale[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had any activity since 90 days. It will be closed if no further activity occurs within 7 days. Thank you for your contributions!

chadlwilson commented 1 month ago

I believe this is still an issue/mystery.

rohanKanojia commented 1 month ago

@chadlwilson : Is it possible to provide more details on how to reproduce this issue?

chadlwilson commented 1 month ago

Sadly I do not personally have an environment that has experienced this.

My gut feel is that this is actually an environment-specific EKS 1.28 or Kubernetes 1.28 issue and unrelated to this client - it's just that the default 10s timeout is what is triggered. So perhaps it's valid to close this as "cannot reproduce" and see if anyone can narrow it down.

I've tried to find changes in Kubernetes 1.28 or EKS 1.28 that might explain extremely slow pod creation but haven't found a smoking gun. My guess is something within Pod Admission Control, ValidatingAdmissionPolicy or slow/problematic webhooks on the server side that is timing out some check but eventually allowing pods to create (or something like that).

rohanKanojia commented 1 month ago

I have tried this on two different clusters with Kubernetes v1.28.0 and the abovementioned code is working as expected.

I think this issue is not specific to any Kubernetes version but more to the cluster configuration. In @alanlyne 's case increasing the connection timeout resolved the issue.

fabric8io / kubernetes-client