[Task Manager] Kibana discovery service

mikecote commented 3 months ago

To support task partitioning, we must make the Kibana nodes aware of how many nodes are currently running and what IDs they have to determine which Kibana node owns which task partitions consistently.

To accomplish this, I propose creating a new service within Kibana Task Manager that leverages Elasticsearch to determine the number of Kibana nodes running.

Requirements

[x] A new background_task_node saved-object type
[x] ID of saved-object determined by the Kibana node UUID (same used by task manager to claim tasks)
[x] Fields contain lastSeen (date)
[x] Each Kibana node upserts its document's lastSeen date on a 10s interval
[x] API to fetch active Kibana nodes. Queries the SO for lastSeen within the last 30s and returns Kibana node UUIDs (document IDs).
[x] Implement a deletion strategy for documents that have a lastSeen older than 5 minutes
[x] Attempt to delete the document when Kibana receives a shutdown signal
[x] SO type is hidden

elasticmachine commented 3 months ago

Pinging @elastic/response-ops (Team:ResponseOps)

pmuellr commented 2 months ago

I wonder if we could (ab)use this for other things. For instance, determining how fresh maintenance windows are, and other persisted data, that we currently refresh for every rule execution. We could use something like the most recent document date, stored in the service data. When the service is updating it's heartbeat, it could also check these saved dates, see if we need to refresh them.

lukeelmers commented 2 months ago

To be clear, we'd only plan to use this internally in TM, right? (Not shared to other plugins as a service)

Attempt to delete the document when Kibana receives a shutdown signal

What is the impact on task partitioning if this fails, would it mean some tasks just don't get scheduled until the 5m last_seen is exceeded and the doc is cleaned up?

mikecote commented 2 months ago

@lukeelmers

To be clear, we'd only plan to use this internally in TM, right? (Not shared to other plugins as a service)

Correct, this is internal to Task Manager only. We won't expose anything by the plugin for others to consume.

What is the impact on task partitioning if this fails, would it mean some tasks just don't get scheduled until the 5m last_seen is exceeded and the doc is cleaned up?

Nothing negative occurs; this is more of a courtesy cleanup alongside a fallback that periodically cleans old documents so the index doesn't indefinitely grow as Kibana nodes stop running.

lukeelmers commented 2 months ago

Thanks for clarifying @mikecote, this makes sense ❤️

My only concern would be if this were exposed for more general purpose use. But considering we are thinking of it as an implementation detail of TM, then I'm not too worried about it.

mikecote commented 2 months ago

The discovery service will be used to assign task partitions to Kibana nodes. Knowing how many nodes are running, we'll ensure that only two nodes share any partition and adjust as Kibana nodes appear/disappear. I have more details on this GH issue (https://github.com/elastic/kibana/issues/187700), and I'm happy to expand further if you like.

kobelb commented 2 months ago

This will never be accurate because it assumes that clocks are synchronized, which is a fundamental principle of distributed systems. Just joking and wanted to get this argument out of the way. This does require clocks to be loosely synchronized to work correctly; however, given the mitigation of two nodes being responsible for each partition, there are already some mitigations in place and we do not need a high degree of precision for this to be utilized by task-manager. However, it is further reasoning to keep this internal to task-manager because for other usages this might be a fundamental flaw.

mikecote commented 2 months ago

Perhaps a better name for the SO type is background_task_node if we want to clarify that it is only for the task manager; this will be hard to change in the future.

pgayvallet commented 2 months ago

We've been talking about having that kind of "Kibana discovery mechanism" for litteraly years now. Like, this is one of the first discussions I remember having when I joined 5 years ago.

We've been discussing it a lot. So we sure do know this push/pull system is imperfect, has limitations, won't be as good as a proper discovery system, doesn't (directly/easily) provide things like leader election or such... So yeah, we do know it can only be used for very specific use case that needs to be carefully chosen.

However, that's still way better than what we have today - Nothing. Void. Nada. KibanaA has no idea if they are alone in the universe or if they have friends. And this is quite sad, in a way. KibanaA could do so many amazing things if they just had a rough idea about the approximative number of nodes in their cluster, and access to information related to those friends.

So, all this amazing story telling to say:

I absolutely get why, from a responseOps perspective, we would like to keep that internal to TM. It's the safe call. You have your use case in mind, and you don't want to open the pandora box of having to support the feature for other potential consumers. And this totally makes sense. From responseOps's perspective.

I think the standpoint from Core / Platform services should be different though. Personally, I know we've been waiting for years for a valid use case to finally be able to start working on that discovery system. Now we have would ihmo looks like the perfect opportunity, I really think we should be talking about the possibility to have this as a platform service instead of some internal implementation detail to TM.

Those were my 2cps.

elasticmachine commented 2 months ago

Pinging @elastic/kibana-core (Team:Core)

mikecote commented 2 months ago

I really think we should be talking about the possibility to have this as a platform service instead of some internal implementation detail to TM.

@pgayvallet would there be interest in moving this service to Core once we figure out how we need it to work for our use case? If so, maybe the Core team can review our approach and let us know what modifications will make it easier to move the service down the line (mappings, SO name, index, etc). I'm siding on this approach rather than having Core build it right away, given we don't know exactly how it should work and our immediate need for such a service. Open to ideas.

rudolf commented 2 months ago

++ This has come up several times, but usually a simpler less optimal solution that doesn't require discovery ends up being chosen. I think it would make a lot of sense to have this as part of the platform. But this doesn't mean that Core needs to build it and I think the priority should be that ResponseOps validates that discovery helps you with partitioning and increases TM throughput.

This was my stab at an algorithm https://github.com/elastic/kibana/issues/93029#issuecomment-916352193 . The biggest difference is that instead of relying on timestamps to detect liveness I try to detect liveness by checking for heartbeats. So the clocks can be out of sync but a node can still be alive.

pgayvallet commented 2 months ago

@mikecote

would there be interest in moving this service to Core once we figure out how we need it to work for our use case? If so, maybe the Core team can review our approach and let us know what modifications will make it easier to move the service down the line

Yeah I think it would make sense, that way we avoid blocking you on that initial implementation, but we would make sure that we could somewhat easily port the concept to a Core service later.

lukeelmers commented 2 months ago

that way we avoid blocking you on that initial implementation

++ I'm not categorically opposed to providing something like this as core service, but just want to make sure we are treating it as a separate discussion based on other valid use cases (besides just this one), and not blocking this effort on it. If Core can provide guidance along the way to make this easier to repurpose in the future if we need to, that sounds good to me. Let's just be sure we aren't sinking large amounts of time into R&D or adding too much complexity to the work ResponseOps is doing.

mikecote commented 2 months ago

The draft PR (https://github.com/elastic/kibana/pull/187997) is starting to shape up, and if anyone wants to see it in practice, it aligns with the issue description.

Thanks, @rudolf, for sharing your thoughts on such a service. If I understand correctly, your approach works with a single Elasticsearch document, and instead of timestamps, it uses hrtime. Is the core piece that the Kibana nodes compare the state object on an interval to determine which Kibana nodes are still "alive" by having a field value changed?

I like that the approach doesn't rely on the clocks being synchronized. I'm unsure if such a pattern could be applied when using a document per Kibana node while preventing the index from ever-growing? From comparing the approaches, my concern about having all the nodes update the same document is the increase in contention as more Kibana nodes are added (think 64 or even 150). The task manager starts seeing contention when multiple nodes try to claim the same tasks more than 5 times per second, and it becomes hard to work around after 10 times per second.

pgayvallet commented 2 months ago

As we agreed, I opened https://github.com/elastic/kibana/issues/188177 to discuss about what a "Core discovery service" would look like and identify which features could benefit from it.

elastic / kibana

[Task Manager] Kibana discovery service #187696

Requirements