devfile / devworkspace-operator

Apache License 2.0
67 stars 55 forks source link

feat: turn healthCheckHttpClient timeout from 500ms to 3s #1321

Closed batleforc closed 1 month ago

batleforc commented 2 months ago

What does this PR do?

It change the timeout of the healthcheck client from 500ms to 3s

What issues does this PR fix or reference?

Linked to https://github.com/eclipse-che/che/issues/23067 Fixes https://github.com/devfile/devworkspace-operator/issues/1325

Is it tested? How?

In progress

PR Checklist

openshift-ci[bot] commented 2 months ago

Hi @batleforc. Thanks for your PR.

I'm waiting for a devfile member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
batleforc commented 2 months ago

Setup in a working environment, with the linked PR work

AObuchow commented 2 months ago

/ok-to-test

batleforc commented 2 months ago

HI @AObuchow, This PR is part of an issue in the Che Side where I have a problem with a slow CNI. Your gut instinct are the same as mine, the end goal would be to have it merged with a possibility to set this value but i was conflicted on how to have it match between the Che Operator and the DevWorkspace Operator. I'm totally okay on reworking this PR and if you have time I'm waiting for your guidance.

dkwon17 commented 1 month ago

Instead of increasing the timeout, what about returning a RetryError when the health check fails here , and handle it with checkDWError? This is so that another reconcile request would be created if the attempt to ping the health endpoint fails.

@batleforc does that work for your use case?

We do something similar when waiting for the workspace deployment to be ready: https://github.com/devfile/devworkspace-operator/blob/0055cb602028aed9a5ce86cbbf61d4f0c4377dd8/pkg/provision/workspace/deployment.go#L91

https://github.com/devfile/devworkspace-operator/blob/0055cb602028aed9a5ce86cbbf61d4f0c4377dd8/controllers/workspace/devworkspace_controller.go#L485-L489

batleforc commented 1 month ago

I think that it could fix the problem, totally answer my case @dkwon17, and could remove the need of changing the Che operator source code. What I really need is to not wait for Five more minutes when the IDE is already up but the CNI took too long to broadcast the IP of the Pod (it annoys me and kind of irritate the user).

batleforc commented 1 month ago

And your answer wouldn't lock the operator on this action and potentially unlock the process for future action

batleforc commented 1 month ago

@dkwon17, i've tried to set it up correctly, not sure if it's correct. I choose to set it with a reconcile after 1 second (could be less but don't know)

dkwon17 commented 1 month ago

In my experience, I've seen DWO continuously re-attempt the health check when getting a 502 bad gateway response. However, in these cases you'd see a Main URL server not ready -- are you not seeing that message?

I've experienced the opposite actually, after handling all the queued reconciles, a retry doesn't happen after a health fail check

dkwon17 commented 1 month ago

@AObuchow I think you're thinking about this case, where the health check endpoint is accessible, but a 502 is returned: https://github.com/devfile/devworkspace-operator/blob/533d1f0510b4dcfaee8682e5e0b8103812f79cfe/controllers/workspace/devworkspace_controller.go#L497-L501

This PR is to handle the case where there is a timeout when trying to access the health check endpoint

AObuchow commented 1 month ago

@AObuchow I think you're thinking about this case, where the health check endpoint is accessible, but a 502 is returned:

https://github.com/devfile/devworkspace-operator/blob/533d1f0510b4dcfaee8682e5e0b8103812f79cfe/controllers/workspace/devworkspace_controller.go#L497-L501

This PR is to handle the case where there is a timeout when trying to access the health check endpoint

@dkwon17 thank you for the confirmation, yes you're right - I believe that's the case I was thinking about.

batleforc commented 1 month ago

Hi @AObuchow @dkwon17 , Didn't have time to try it yet (was pretty late), I will try to set it up in both instance that encounter the problem during the following day and will come back to you as fast as possible.

AObuchow commented 1 month ago

Hi @AObuchow @dkwon17 , Didn't have time to try it yet (was pretty late), I will try to set it up in both instance that encounter the problem during the following day and will come back to you as fast as possible.

Sounds great to me :) I hope this resolves your issue. FWIW: This PR is approved and can be merged (after my final request on cleaning up the commit log) :)

batleforc commented 1 month ago

So, I squashed the commit, but I still need to test it on the last env that has this problem quite frequently (a Devspaces 3.16.1 updated recently)

AObuchow commented 1 month ago

@batleforc Thank you for updating your PR, sounds good to me. Keep us updated when you have a moment :) It's greatly appreciated.

openshift-ci[bot] commented 1 month ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: AObuchow, batleforc, dkwon17, ibuziuk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/devfile/devworkspace-operator/blob/main/OWNERS)~~ [AObuchow,dkwon17] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
batleforc commented 1 month ago

After 5 day of testing, the problem didn't appear again, so it's a fix. And the user are more than happy about the 1 minute at most startup :rofl:

AObuchow commented 1 month ago

After 5 day of testing, the problem didn't appear again, so it's a fix. And the user are more than happy about the 1 minute at most startup 🤣

I'm glad to hear this PR seems to resolve your issue :grin: Thank you so much for the great contribution @batleforc & thank you, @dkwon17, for steering this PR in the right direction with your review :)

Merging this PR now.