[Core] Kibana discovery service

pgayvallet commented 4 months ago

Supersede/replace https://github.com/elastic/kibana/issues/93029

In https://github.com/elastic/kibana/issues/187696, response ops is planning to implement a discovery service for task manager, for perf / workload optimizations. That implementation is planned to remain internal to task manager.

However, we agreed that having such feature implemented as a Core service could potentially make sense, as taskManager workload balance is not the only feature that could benefit from accessing information about an appromixative state of the "Kibana cluster.

I open this issue to discuss about what this "discovery system" could look like as a Core service, and confirm other features would benefit from it:

List the features/consumers that could benefit from it to evaluate the concrete need for such service at Core's level
Given 1., see how much we would need to adapt / diverge from the implementation used by response-ops
Start thinking about what the API surface could look like for such service
1. how to consume it
2. would it be possible for API consumers to "enhance" it (so add more info to the "node status")

elasticmachine commented 4 months ago

Pinging @elastic/kibana-core (Team:Core)

pgayvallet commented 4 months ago

Starting the discussions by sharing my (unstructured) thoughts:

The first use case coming to mind that could benefit from this feature is around Kibana status and/or diagnostic:

The status API
The kibana diagnostic tools

Especially for the status API, if we were to store the status of each node in their "cluster state" (name tbd) data, we could finally have a status API that returns the overall status of our Kibana cluster, and not only the status of the node we are querying. Not sure if it really would add value for orchestrated environment where the status aggregation is done at a higher level, it would probably at least make sense for on-prem deployments (but as said, it could also potentially bring more value to all envs)

Then, the second use case would be multi-stage deployment.I don't have a great example there, but I feel like having each node being able to know if the version used across all live nodes of the cluster is the same could be useful for the purpose of "automated" behavioral changes of a multi-stage deployment.

rudolf commented 4 months ago

We've gotten very far in scaling saved object migrations by just transforming less documents. But we haven't fundamentally increased migration throughput. The bottleneck there is the time it takes to load one batch, transform the documents and write it back. This causes a lot of dead time because of the waiting on read and writes. We can't parallelise this or increase batch sizes much because then Kibana runs out of RAM. To improve this we'd want to shard the migration work across the Kibana nodes. This way each Kibana node would only transform a subset of documents increasing transform/CPU throughput. And we'd parallelise the read,transform,write loop better utilising Elasticsearch and minimising the cost of network latency.

we could finally have a status API that returns the overall status of our Kibana cluster,

+100 cluster-wide status could be very useful

elastic / kibana

[Core] Kibana discovery service #188177