Open batleforc opened 1 month ago
@batleforc exposure of the route should be relatively fast, could you please clarify when exactly are you facing this issue ( 5/10 min for the route to be accessible). The default hard startup timeout is 5 mins and at this point we do not plan to change it.
Due to this case, we upped the timeout to 900s. In theory, it should be fast, but I encountered the case either on an Openshif on AWS (with like ~7 user) and on Kubernetes on bare metal (1 to 4 user). The initial two call of the healthz endpoint end up immediately returning with a bad gateway from the main gateway and the user has at least 5 min to wait (the case of the 10 minute isn't narrowed down precisely, but we need to reduce this one first)
We found out that the propagation of the service's ip to the targeted pod take some time and some time came a little bit after the pod are up but not soon enough for the backend. That's why I add a little retry in https://github.com/eclipse-che/che-operator/pull/1874 that should cover the propagation time, but I would love to make it a parameter that the end user could tune in case of pretty slow CNI.
To debug that, we used the different pod to debug the full chain of acknowledgement that the deployment is ready for the next step of startup. And have seen that either we need to add a little time in between the two call of the health on the backend side, or we add retry directly in the gateway. (need test with replacing the different element in the kube)
Is it possible to have help in order to check if the change added in https://github.com/eclipse-che/che-operator/pull/1874 can fix the problem we encounter ? (Building the image mostly and a possible case on how we can make the retry healthz modular https://github.com/eclipse-che/che-operator/pull/1874/files#diff-ebca2eefe12f7ba4a722c53d574ba1b2adee412909da8cdbc974c8f7fcbfb02fR655 ?)
Hello
Please try this image based on the PR
quay.io/abazko/operator:23067
Hello @batleforc Does it work for you?
Hello @tolusha , i've set it up but i think i need to fine tune the initial Interval
Is the provided image (quay.io/abazko/operator:23067) automatically updated ?
Unfortunately now.
You can build the image by the following command:
make docker-build docker-push IMG=<IMAGE_NAME> SKIP_TESTS=true
So, the build seems okay, but I encounter a Client.Timeout exceeded while awaiting headers
and I can't find where the devworkspace-controller-manager does the call to the healthz endpoint
Describe the bug
During the startup process of workspace's pod, sometimes the ip take time to propagate and the double call of the healthz endpoint immediately came back with a bad gateway.
Che version
7.88
Steps to reproduce
Expected behavior
Don't wait 5 more minute when, the cluster take a short time to propagate the corresponding ip (like most of our case) and wait the 5 more minute when side resources take time loading.
Runtime
Kubernetes (vanilla), OpenShift
Screenshots
No response
Installation method
chectl/latest, chectl/next, OperatorHub
Environment
Linux, Amazon
Eclipse Che Logs
No response
Additional context
No response