Open barkbay opened 2 years ago
While investigating this bug, I see 2 possible options to resolve:
@barkbay Had you found a consistent way to replicate this error in a local cluster? Would simply creating an ES cluster very quickly after starting the operator replicate this?
Are there any other potential solutions that come to mind?
- We could detect that the cache is not started error and simply retry for a short period of time to try and wait for cache sync to complete.
If we wait and retry for too long I'm wondering if it could affect the maximum inflight operations for the API Server (a.k.a. "server's total concurrency limit")? Could it be the case if there is fair amount of existing resources, slowing down the cache initialization, and a lot of validation requests at the same time?
- There seems to be an option for disabling cache reads for certain objects when creating a new client. This generated client could be passed only to the webhook validation logic. This option seems more dangerous, as it would add more load to apiserver of any Kubernetes cluster for the life of the operator.
I think we need to understand what would be the load generated at scale on the API server in this case if we want to bypass the cache. But yes, we could use a dedicated client with no cache at all...
Had you found a consistent way to replicate this error in a local cluster?
I can't remember tbh.
I think a third option is described in https://github.com/elastic/cloud-on-k8s/issues/5032:
As a side note it might be a workaround for https://github.com/elastic/cloud-on-k8s/issues/5025 if we consider that having a started cached client is a requirement before allowing the elastic-webhook-server service to send traffic to the operator's Pods.
Webhook servers are started before the cached client is started. Reason is stated in the controller runtime source code:
The cached client is however used in the Elasticsearch webhook, to ensure that the storage class does support volume expansion for example. It may lead to an error if the webhook is called while the cached client is not yet started: