Open pgayvallet opened 4 months ago
Pinging @elastic/kibana-core (Team:Core)
Starting the discussions by sharing my (unstructured) thoughts:
The first use case coming to mind that could benefit from this feature is around Kibana status and/or diagnostic:
Especially for the status API, if we were to store the status of each node in their "cluster state" (name tbd) data, we could finally have a status API that returns the overall status of our Kibana cluster, and not only the status of the node we are querying. Not sure if it really would add value for orchestrated environment where the status aggregation is done at a higher level, it would probably at least make sense for on-prem deployments (but as said, it could also potentially bring more value to all envs)
Then, the second use case would be multi-stage deployment.I don't have a great example there, but I feel like having each node being able to know if the version used across all live nodes of the cluster is the same could be useful for the purpose of "automated" behavioral changes of a multi-stage deployment.
We've gotten very far in scaling saved object migrations by just transforming less documents. But we haven't fundamentally increased migration throughput. The bottleneck there is the time it takes to load one batch, transform the documents and write it back. This causes a lot of dead time because of the waiting on read and writes. We can't parallelise this or increase batch sizes much because then Kibana runs out of RAM. To improve this we'd want to shard the migration work across the Kibana nodes. This way each Kibana node would only transform a subset of documents increasing transform/CPU throughput. And we'd parallelise the read,transform,write loop better utilising Elasticsearch and minimising the cost of network latency.
we could finally have a status API that returns the overall status of our Kibana cluster,
+100 cluster-wide status could be very useful
Supersede/replace https://github.com/elastic/kibana/issues/93029
In https://github.com/elastic/kibana/issues/187696, response ops is planning to implement a discovery service for task manager, for perf / workload optimizations. That implementation is planned to remain internal to task manager.
However, we agreed that having such feature implemented as a Core service could potentially make sense, as taskManager workload balance is not the only feature that could benefit from accessing information about an appromixative state of the "Kibana cluster.
I open this issue to discuss about what this "discovery system" could look like as a Core service, and confirm other features would benefit from it:
1.
, see how much we would need to adapt / diverge from the implementation used by response-ops