Closed rtluckie closed 2 months ago
One thing you can do to work around this limitation is to create multiple node sets with the data role and scale each of those up until you start running into the size limitation of k8s secrets which seems to be around 150-200 nodes. You can then keep adding node sets until reach the desired scale. See this issue for more context on the current model of one secret for transport certificates per node set.
@pebrc are there any plans to address this? it's been several years since the workaround was implemented. we run a very large deployment of many ES clusters (of which this operator has been fantastically helpful), so when adding some of our more larger clusters, i bumped into this error. quite a surprise, you can imagine.
I'm wondering if we could stop reconciling that Secret
if we use a CSI driver to manage the certificates for example? (Or give an option to the user skip the reconciliation of that Secret
?)
@barkbay I think that's a good idea.
@nullren we don't have concrete plans to address this right now. Did the workaround, using multiple node sets instead of one big one, have drawbacks for you that made you want to stick with a single node set?
@barkbay I think that's a good idea.
@nullren we don't have concrete plans to address this right now. Did the workaround, using multiple node sets instead of one big one, have drawbacks for you that made you want to stick with a single node set?
The work around did "work", but it is a whole lot of unnecessary complexity for something we don't even use (we disable security and dont use the certs at all as we use our own network framework on k8s). There's just a lot of extra tooling we have to update to ensure that node sets "data-0", "data-1", ..., "data-N" are all found and reconciled correctly. Still finding some bugs due to this.
We have implemented an option to turn off the ECK managed self-signed certificates in https://github.com/elastic/cloud-on-k8s/pull/7925 which is going to ship with the next release of ECK. This should cover the case you mentioned @nullren. This means we now have two workarounds for large clusters:
Either:
My vote would be to close this issue unless there are additional concerns we did not address with these changes.
@pebrc that works for me. Thank you!
Bug Report
What did you do?
What did you expect to see?
What did you see instead? Under which circumstances?
Failed remediations
Environment
ECK version: 2.8.0
Kubernetes information:
kubectl version: v1.27.2
Resource definition:
Continuous loop of reconciliation failures and timeout accompanied by the following.