elastic / cloud-on-k8s

Elastic Cloud on Kubernetes
Other
45 stars 706 forks source link

Race condition: k8s cached client may not be started while used in a webhook #5025

Open barkbay opened 2 years ago

barkbay commented 2 years ago

Webhook servers are started before the cached client is started. Reason is stated in the controller runtime source code:

    // First start any webhook servers, which includes conversion, validation, and defaulting
    // webhooks that are registered.
    //
    // WARNING: Webhooks MUST start before any cache is populated, otherwise there is a race condition
    // between conversion webhooks and the cache sync (usually initial list) which causes the webhooks
    // to never start because no cache can be populated.

The cached client is however used in the Elasticsearch webhook, to ensure that the storage class does support volume expansion for example. It may lead to an error if the webhook is called while the cached client is not yet started:

2021-11-08T16:11:49.285+0100    DEBUG   controller-runtime.webhook.webhooks     received request        {"service.version": "2.0.0-SNAPSHOT", "webhook": "/validate-elasticsearch-k8s-elastic-co-v1-elasticsearch", "UID": "705ab4f5-6393-11e8-b7cc-42010a800002", "kind": "/, Kind=", "resource": {"group":"","version":"","resource":""}}
2021-11-08T16:11:49.295+0100    DEBUG   es-validation   validate update {"service.version": "2.0.0-SNAPSHOT", "name": "elasticsearch-sample"}
2021-11-08T16:11:49.295+0100    ERROR   es-validation   error while validating pvc modification, skipping validation    {"service.version": "2.0.0-SNAPSHOT", "error": "the cache is not started, can not read objects"}
github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/validation.validPVCModification
        /Users/michael/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/validation/volume_validation.go:120
github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/validation.updateValidations.func1
        /Users/michael/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/validation/validations.go:67
github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/validation.(*validatingWebhook).validateUpdate
        /Users/michael/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/validation/webhook.go:65
github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/validation.(*validatingWebhook).Handle
        /Users/michael/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/validation/webhook.go:98
sigs.k8s.io/controller-runtime/pkg/webhook/admission.(*Webhook).Handle
        /Users/michael/go/src/sigs.k8s.io/controller-runtime/pkg/webhook/admission/webhook.go:146
sigs.k8s.io/controller-runtime/pkg/webhook/admission.(*Webhook).ServeHTTP
        /Users/michael/go/src/sigs.k8s.io/controller-runtime/pkg/webhook/admission/http.go:99
github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerInFlight.func1
        /Users/michael/go/pkg/mod/github.com/prometheus/client_golang@v1.11.0/prometheus/promhttp/instrument_server.go:40
net/http.HandlerFunc.ServeHTTP
        /usr/local/go/src/net/http/server.go:2046
github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerCounter.func1
        /Users/michael/go/pkg/mod/github.com/prometheus/client_golang@v1.11.0/prometheus/promhttp/instrument_server.go:101
net/http.HandlerFunc.ServeHTTP
        /usr/local/go/src/net/http/server.go:2046
github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerDuration.func2
        /Users/michael/go/pkg/mod/github.com/prometheus/client_golang@v1.11.0/prometheus/promhttp/instrument_server.go:76
net/http.HandlerFunc.ServeHTTP
        /usr/local/go/src/net/http/server.go:2046
net/http.(*ServeMux).ServeHTTP
        /usr/local/go/src/net/http/server.go:2424
net/http.serverHandler.ServeHTTP
        /usr/local/go/src/net/http/server.go:2878
net/http.initALPNRequest.ServeHTTP
        /usr/local/go/src/net/http/server.go:3479
net/http.(*http2serverConn).runHandler
        /usr/local/go/src/net/http/h2_bundle.go:5823
2021-11-08T16:11:49.296+0100    DEBUG   controller-runtime.webhook.webhooks     wrote response  {"service.version": "2.0.0-SNAPSHOT", "webhook": "/validate-elasticsearch-k8s-elastic-co-v1-elasticsearch", "code": 200, "reason": "", "UID": "705ab4f5-6393-11e8-b7cc-42010a800002", "allowed": true}
Exiting.
naemono commented 1 year ago

While investigating this bug, I see 2 possible options to resolve:

  1. We could detect that the cache is not started error and simply retry for a short period of time to try and wait for cache sync to complete.
  2. There seems to be an option for disabling cache reads for certain objects when creating a new client. This generated client could be passed only to the webhook validation logic. This option seems more dangerous, as it would add more load to apiserver of any Kubernetes cluster for the life of the operator.

@barkbay Had you found a consistent way to replicate this error in a local cluster? Would simply creating an ES cluster very quickly after starting the operator replicate this?

Are there any other potential solutions that come to mind?

barkbay commented 1 year ago
  1. We could detect that the cache is not started error and simply retry for a short period of time to try and wait for cache sync to complete.

If we wait and retry for too long I'm wondering if it could affect the maximum inflight operations for the API Server (a.k.a. "server's total concurrency limit")? Could it be the case if there is fair amount of existing resources, slowing down the cache initialization, and a lot of validation requests at the same time?

  1. There seems to be an option for disabling cache reads for certain objects when creating a new client. This generated client could be passed only to the webhook validation logic. This option seems more dangerous, as it would add more load to apiserver of any Kubernetes cluster for the life of the operator.

I think we need to understand what would be the load generated at scale on the API server in this case if we want to bypass the cache. But yes, we could use a dedicated client with no cache at all...

Had you found a consistent way to replicate this error in a local cluster?

I can't remember tbh.

I think a third option is described in https://github.com/elastic/cloud-on-k8s/issues/5032:

As a side note it might be a workaround for https://github.com/elastic/cloud-on-k8s/issues/5025 if we consider that having a started cached client is a requirement before allowing the elastic-webhook-server service to send traffic to the operator's Pods.