Kubernetes marked as `reachable` when it is not (timeout)

axel7083 commented 3 days ago

Bug description

I have an OpenShift cluster behind a VPN, meaning the cluster cannot be reached when the VPN is not connected, the error is not traditional when using kubectl

$: kubectl get pods
E1021 17:03:47.707009   24376 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://<censored>:6443/api?timeout=32s\": dial tcp 10.31.87.18:6443: i/o timeout"
E1021 17:04:17.724441   24376 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://<censored>:6443/api?timeout=32s\": dial tcp 10.31.86.78:6443: i/o timeout"
E1021 17:04:47.743776   24376 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://<censored>:6443/api?timeout=32s\": dial tcp 10.31.86.78:6443: i/o timeout"
E1021 17:05:17.763772   24376 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://<censored>:6443/api?timeout=32s\": dial tcp 10.31.86.155:6443: i/o timeout"
E1021 17:05:47.803610   24376 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://<censored>:6443/api?timeout=32s\": dial tcp 10.31.86.155:6443: i/o timeout"
Unable to connect to the server: dial tcp 10.31.86.155:6443: i/o timeout

However inside Podman Desktop (current main is https://github.com/containers/podman-desktop/commit/6eb2c161345cc19f7868157ae65062ad8bcbbba4) I am seeing the following

Obviously nothing is visible in the Kubernetes pages, because the cluster is not reachable

Operating system

Windows 11

Installation Method

Other

Version

next (development version)

Steps to reproduce

No response

Relevant log output

main ↪️ Trying to watch deployments on the kubernetes context named ":6443/astefani" but got a connection refused, retrying the connection in 1s. FetchError: request to https://:6443/apis/apps/v1/namespaces/rhoai-internal--astefani-nb/deployments failed, reason: ) ... main ↪️ Error while fetching API groups: FetchError: request to https://:6443/apis failed, reason:

Additional context

No response

axel7083 commented 3 days ago

After some investigation, I run podman-desktop in debugger to check when reachable was set to true

And here is the trace

Inside informer.on connect listener we are getting an undefined err

https://github.com/containers/podman-desktop/blob/0d4ec38eedb768ae9e23a379761aad3f7b0159e5/packages/main/src/plugin/kubernetes/contexts-manager.ts#L855-L859

Here is the stack of the debugger

anonymous(), contexts-manager.ts:898
Async call from Timeout
setReachable(), contexts-manager.ts:892
setReachableDelay(), contexts-manager.ts:876
anonymous(), contexts-manager.ts:862
restartInformer(), contexts-manager.ts:925
createInformer(), contexts-manager.ts:866
createPodInformer(), contexts-manager.ts:383
createKubeContextInformers(), contexts-manager.ts:309
update(), contexts-manager.ts:238
refresh(), kubernetes-client.ts:499
Async call from await
anonymous(), contexts-manager.ts:862
restartInformer(), contexts-manager.ts:925
createInformer(), contexts-manager.ts:866
createPodInformer(), contexts-manager.ts:383
createKubeContextInformers(), contexts-manager.ts:309
update(), contexts-manager.ts:238
refresh(), kubernetes-client.ts:499
Async call from await

I am not familiar with the kubernetes npm package, but maybe they do not through an error when we force them to start ?

feloy commented 3 days ago

As far as I can remember, we never get the connection failure with the connect event, but with the error event only. We are setting to reachable when trying to connect, and unreachable when an error occurs. There is no other way, as we are not getting an event when we are effectively connected (except some ADDED events, but only if there are resources in the context).

I'm not sure to understand the output of your kubectl get pods commands, are you getting the error immediately, or after 30s?

When the error happens immediately after the connect (which is the case with kind or a cluster whose machine is accessible), the reachable status is overriden immediately with the status is set in error, and we cannot see it. But if the error comes after 30s, the cluster will be seen as reachable for 30s.

feloy commented 3 days ago

More info at: https://github.com/containers/podman-desktop/issues/7629

axel7083 commented 3 days ago

When we call informer.start() we will never receive an error,

the start method call the doneHandler with a null value

https://github.com/kubernetes-client/javascript/blob/548a174400a9c215b5078efe9e3c1646e52162be/src/cache.ts#L54-L57

then send an undefined to all connect listener

https://github.com/kubernetes-client/javascript/blob/548a174400a9c215b5078efe9e3c1646e52162be/src/cache.ts#L159

From my understanding, the problem is the following

They send an event to the connect with no error before trying the listFn function, which would timeout, meaning we should probably not set the cluster reachable from inside the connect listener

axel7083 commented 3 days ago

Here is a schema of what is happening

sequenceDiagram
    Context-Manager-->>Informer: register connect listener
    Context-Manager-->>Informer: register error listener
    loop Forever
        Context-Manager->>Informer: start()
        Informer->>Context-Manager: call connect listener (no error)
        Context-Manager-->>Context-Manager: set reachable true
        Note right of Informer: a few seconds later
        Informer->>Context-Manager: call error listener (timeout)
        Context-Manager-->>Context-Manager: set reachable false
    end

To also give a better illustration, here is a accelerated video of what is happening visually

https://github.com/user-attachments/assets/6823a258-e27b-42b2-85e7-c4cf444e7be0

feloy commented 3 days ago

When we call informer.start() we will never receive an error,

the start method call the doneHandler with a null value

https://github.com/kubernetes-client/javascript/blob/548a174400a9c215b5078efe9e3c1646e52162be/src/cache.ts#L54-L57

then send an undefined to all connect listener

https://github.com/kubernetes-client/javascript/blob/548a174400a9c215b5078efe9e3c1646e52162be/src/cache.ts#L159

From my understanding, the problem is the following

They send an event to the connect with no error before trying the listFn function, which would timeout, meaning we should probably not set the cluster reachable from inside the connect listener

Yes, this is what I wanted to explain.

The problem is that we never receive an event that we are effectively connected. The only way would be to say that we are connected after some timeout and if we did not receive an error (or if we receive ADDED events, but this does not happen on context where there is not pod). Or to use a direct HTTP request like in #7629 , where we would get an aknowledge of the connection

feloy commented 3 days ago

I don't think we can reasonably ask to make changes on the informer behaviour. This implementation is based on the Go implementation and they try to keep them in sync, and I'm pretty sure they are happy with the current behaviour. The best change we could do IMHO would be to check connectivity with a simple HTTP request (#7629, or some get version request)

axel7083 commented 2 days ago

thanks @feloy for the explanations and details 👍

Keeping this open, as it is a problem on its own, but should be resolved when https://github.com/containers/podman-desktop/issues/7629 is implemented

containers / podman-desktop