Closed batleforc closed 1 month ago
Hi @batleforc. Thanks for your PR.
I'm waiting for a devfile member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test
on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test
label.
I understand the commands that are listed here.
Setup in a working environment, with the linked PR work
/ok-to-test
HI @AObuchow, This PR is part of an issue in the Che Side where I have a problem with a slow CNI. Your gut instinct are the same as mine, the end goal would be to have it merged with a possibility to set this value but i was conflicted on how to have it match between the Che Operator and the DevWorkspace Operator. I'm totally okay on reworking this PR and if you have time I'm waiting for your guidance.
Instead of increasing the timeout, what about returning a RetryError
when the health check fails here , and handle it with checkDWError? This is so that another reconcile request would be created if the attempt to ping the health endpoint fails.
@batleforc does that work for your use case?
We do something similar when waiting for the workspace deployment to be ready: https://github.com/devfile/devworkspace-operator/blob/0055cb602028aed9a5ce86cbbf61d4f0c4377dd8/pkg/provision/workspace/deployment.go#L91
I think that it could fix the problem, totally answer my case @dkwon17, and could remove the need of changing the Che operator source code. What I really need is to not wait for Five more minutes when the IDE is already up but the CNI took too long to broadcast the IP of the Pod (it annoys me and kind of irritate the user).
And your answer wouldn't lock the operator on this action and potentially unlock the process for future action
@dkwon17, i've tried to set it up correctly, not sure if it's correct. I choose to set it with a reconcile after 1 second (could be less but don't know)
In my experience, I've seen DWO continuously re-attempt the health check when getting a 502 bad gateway response. However, in these cases you'd see a Main URL server not ready -- are you not seeing that message?
I've experienced the opposite actually, after handling all the queued reconciles, a retry doesn't happen after a health fail check
@AObuchow I think you're thinking about this case, where the health check endpoint is accessible, but a 502 is returned: https://github.com/devfile/devworkspace-operator/blob/533d1f0510b4dcfaee8682e5e0b8103812f79cfe/controllers/workspace/devworkspace_controller.go#L497-L501
This PR is to handle the case where there is a timeout when trying to access the health check endpoint
@AObuchow I think you're thinking about this case, where the health check endpoint is accessible, but a 502 is returned:
This PR is to handle the case where there is a timeout when trying to access the health check endpoint
@dkwon17 thank you for the confirmation, yes you're right - I believe that's the case I was thinking about.
Hi @AObuchow @dkwon17 , Didn't have time to try it yet (was pretty late), I will try to set it up in both instance that encounter the problem during the following day and will come back to you as fast as possible.
Hi @AObuchow @dkwon17 , Didn't have time to try it yet (was pretty late), I will try to set it up in both instance that encounter the problem during the following day and will come back to you as fast as possible.
Sounds great to me :) I hope this resolves your issue. FWIW: This PR is approved and can be merged (after my final request on cleaning up the commit log) :)
So, I squashed the commit, but I still need to test it on the last env that has this problem quite frequently (a Devspaces 3.16.1 updated recently)
@batleforc Thank you for updating your PR, sounds good to me. Keep us updated when you have a moment :) It's greatly appreciated.
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: AObuchow, batleforc, dkwon17, ibuziuk
The full list of commands accepted by this bot can be found here.
The pull request process is described here
After 5 day of testing, the problem didn't appear again, so it's a fix. And the user are more than happy about the 1 minute at most startup :rofl:
After 5 day of testing, the problem didn't appear again, so it's a fix. And the user are more than happy about the 1 minute at most startup 🤣
I'm glad to hear this PR seems to resolve your issue :grin: Thank you so much for the great contribution @batleforc & thank you, @dkwon17, for steering this PR in the right direction with your review :)
Merging this PR now.
What does this PR do?
It change the timeout of the healthcheck client from 500ms to 3s
What issues does this PR fix or reference?
Linked to https://github.com/eclipse-che/che/issues/23067 Fixes https://github.com/devfile/devworkspace-operator/issues/1325
Is it tested? How?
In progress
PR Checklist
/test v8-devworkspace-operator-e2e, v8-che-happy-path
to trigger)v8-devworkspace-operator-e2e
: DevWorkspace e2e testv8-che-happy-path
: Happy path for verification integration with Che