ManageIQ / kubeclient

A Ruby client for Kubernetes REST API
MIT License
416 stars 166 forks source link

feat: add timeout_seconds option to Kubeclient::Client#watch_entities #624

Closed jperville closed 9 months ago

jperville commented 10 months ago

This small PR adds a "timeout_seconds" option to the "Kubeclient::Client#watch_entities" method (not set by default). The objective is to limit the duration of a watch to prevent zombification if the watch lasts for too long (see #512).

We see the zombification often on Azure kubernetes cluster, where our watch stays up but does not receive notifications anymore. By setting a timeout_seconds of X seconds we are sure that we are not zombified for more than X seconds.

Happy to take feedback to improve this PR.

cben commented 10 months ago

Huh. I was expecting this would set some network timeout settings on our side (we have these configurable for non-watch actionsy; watch was originally deliberately excluded to be unlimited) — but instead I see this is sending timeoutSeconds query param to the server.

https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/

timeoutSeconds Timeout for the list/watch call. This limits the duration of the call, regardless of any activity or inactivity.

Do you know how this differs from client-side timeouts? If you want a timeout on a watch, when would you want it server-side / client-side / both?

In any case, if k8s API takes this param, we should support it :+1:
Please update README.md to explain this.

jperville commented 10 months ago

Hello @cben , actually I am fighting trying to workaround zombification when watching the k8s API on Azure for changes in a CRD.

After a while, watch appears to still be up but we don't receive any more updates.

My first attempt was making the watch http_client persistent (by hacking Kubeclient::Common::WatchStream#build_client method) and it didn't solve my problem. The http gem does not allow to set a keep-alive on its underlying TCP socket that I know.

My second attempt uses this PR: I pass a timeoutSeconds when creating a watch. It limits the maximum duration of a watch, reducing (but not eliminating) the zombification. I think that this PR should be merged since timeoutSeconds is part of the kubernetes watch API.

What I'm going to try next:

jperville commented 10 months ago

@cben I just updated the README to mention how to force watches to have a maximum duration with the timeout_seconds keyword arguments.

If you know a way to make watches timeout on the client side if inactive I am interested.

EDIT : I just found out about Kubeclient::Informer and I managed to code something that works (even if the Azure K8S apiserver is very aggressive with connection resets).

cben commented 9 months ago

Oh wow, I dropped the ball on this one. LGTM, merging. :+1:

Yeah, Kubeclient::Informer strives to handle watch restarts in a correct way, though it's new-ish and there might be loose ends (search issues for "informer").

[I have to confess it's been a long time since I've used kubeclient myself; I'm acting as limping-along-barely-maintainer and that's why I'm trying to "crowdsource" things like README etc to contributors but would be good if active user(s) stepped up to replace me...]

cben commented 9 months ago

Close-cycled to re-run CI. It now fails to run on ruby 2.7:

ERROR:  Error installing bundler:
    The last version of bundler (>= 0) to support your Ruby & RubyGems was 2.4.22. Try installing it with `gem install bundler -v 2.4.22`
    bundler requires Ruby version >= 3.0.0. The current ruby version is 2.7.8.225.

but this passes tests on linux 2.7 locally.