eclipse-che / che

Kubernetes based Cloud Development Environments for Enterprise Teams
http://eclipse.org/che
Eclipse Public License 2.0
6.95k stars 1.19k forks source link

Healthz bad gateway when pod/service ip take time to propagate #23067

Open batleforc opened 1 month ago

batleforc commented 1 month ago

Describe the bug

During the startup process of workspace's pod, sometimes the ip take time to propagate and the double call of the healthz endpoint immediately came back with a bad gateway.

Che version

7.88

Steps to reproduce

  1. Start a DevSpaces/Eclipse Che env on an openshift/kubernetes cluster that can take some time to propagate the ip address of the service/pod.
  2. Start a workspace
  3. if you have luck the workspace will start likely in the second, if not the workspace will take approximatively 5/10 min or more to start.

Expected behavior

Don't wait 5 more minute when, the cluster take a short time to propagate the corresponding ip (like most of our case) and wait the 5 more minute when side resources take time loading.

Runtime

Kubernetes (vanilla), OpenShift

Screenshots

No response

Installation method

chectl/latest, chectl/next, OperatorHub

Environment

Linux, Amazon

Eclipse Che Logs

No response

Additional context

No response

ibuziuk commented 1 month ago

@batleforc exposure of the route should be relatively fast, could you please clarify when exactly are you facing this issue ( 5/10 min for the route to be accessible). The default hard startup timeout is 5 mins and at this point we do not plan to change it.

batleforc commented 1 month ago

Due to this case, we upped the timeout to 900s. In theory, it should be fast, but I encountered the case either on an Openshif on AWS (with like ~7 user) and on Kubernetes on bare metal (1 to 4 user). The initial two call of the healthz endpoint end up immediately returning with a bad gateway from the main gateway and the user has at least 5 min to wait (the case of the 10 minute isn't narrowed down precisely, but we need to reduce this one first)

We found out that the propagation of the service's ip to the targeted pod take some time and some time came a little bit after the pod are up but not soon enough for the backend. That's why I add a little retry in https://github.com/eclipse-che/che-operator/pull/1874 that should cover the propagation time, but I would love to make it a parameter that the end user could tune in case of pretty slow CNI.

batleforc commented 1 month ago

To debug that, we used the different pod to debug the full chain of acknowledgement that the deployment is ready for the next step of startup. And have seen that either we need to add a little time in between the two call of the health on the backend side, or we add retry directly in the gateway. (need test with replacing the different element in the kube)

batleforc commented 1 week ago

Is it possible to have help in order to check if the change added in https://github.com/eclipse-che/che-operator/pull/1874 can fix the problem we encounter ? (Building the image mostly and a possible case on how we can make the retry healthz modular https://github.com/eclipse-che/che-operator/pull/1874/files#diff-ebca2eefe12f7ba4a722c53d574ba1b2adee412909da8cdbc974c8f7fcbfb02fR655 ?)

tolusha commented 1 week ago

Hello Please try this image based on the PR quay.io/abazko/operator:23067

tolusha commented 2 days ago

Hello @batleforc Does it work for you?

batleforc commented 2 days ago

Hello @tolusha , i've set it up but i think i need to fine tune the initial Interval

batleforc commented 2 days ago

Is the provided image (quay.io/abazko/operator:23067) automatically updated ?

tolusha commented 2 days ago

Unfortunately now. You can build the image by the following command: make docker-build docker-push IMG=<IMAGE_NAME> SKIP_TESTS=true

batleforc commented 1 day ago

So, the build seems okay, but I encounter a Client.Timeout exceeded while awaiting headers and I can't find where the devworkspace-controller-manager does the call to the healthz endpoint