apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.47k stars 3.7k forks source link

[Proposal] HTTPAnnouncer + Remove Zookeeper Dependency #2312

Closed guobingkun closed 5 years ago

guobingkun commented 8 years ago

HTTP Announcer

Druid uses Announcer for internal announcement and CuratorServiceAnnouncer for external announcement. Both of them reply on the existence of Zookeeper, as a step of moving toward to removing the dependency of Zookeeper, I am proposing an HTTPAnnouncer that can be used to announce various information to Coordinator instead of Zookeeper, Coordinator should be able to still function as it currently does using the information collected from HTTPAnnouncer.

An HTTP announcer will be bound to a Druid node at a very early stage of its lifecycle. When the node' lifecycle starts, HTTPAnnouncer starts a background task that periodically sends heartbeat message to Coordinator. This heartbeat message servers as an indication that its associated Druid node exists in the cluster. That Coordinator not receiving a node's heartbeat messages for a long time means that the node has died. The heartbeat message should only contain necessary and lightweight metadata in order for Coordinator to identify its Druid node or even its type/capabilities.

Besides sending heartbeat messages in background, HTTPAnnouncer can also be used to announce other different kinds of metadata (e.g., segment being served, running tasks, etc) to Coordinator.

Remove Zookeeper dependency

Once HTTPAnnouncer is ready, Coordinator should be able to build up a brief view of the entire Druid cluster, it doesn't need to store all the details about Druid nodes, but it should be able to ask for a specific Druid node for information when it needs.

Since there is no Zookeeper, there are some open questions that are worth discussing,

How does Coordinator maintain a segment view of what realtime/historical nodes are serving what segments?

Realtime and historical nodes will actively announce segments when they are indeed serving them. Coordinator will receive those announcements and update the segment view accordingly. Meanwhile, Coordinator will also have a background task running that periodically asks those historical/realtime nodes if they are still serving the segments.

How does Broker maintain a segment view of what realtime/historical nodes are serving what segments?

IMO Broker should ask for that view from Coordinator at startup, and receive HTTP callbacks from Coordinator when the view has changed.

How to assign a segment to a historical?

Coordinator should be able to maintain a view of all the historical nodes just like it currently does with Zookeeper. LoadQueuePeon should still be used for loading/dropping segments except that instead of creating zNodes on Zookeeper, it will send http request to historical.

2314 is a concrete proposal to answer this question.

How to do leader election?

There are various leader election algorithms available, Raft is one of the option that is already mentioned in the Druid roadmap.

How does Druid recover in case of Coordinator leader crashes?

Since there is no Zookeeper, each Druid node should have a runtime.property that tells it who are the Coordinators, for example, druid.coordinator.hosts=host1:port1,host2,port2,host3,port3 In case of leader crashes, the heartbeat message sent to the old leader will no long get acknowledged, at that moment a Druid node should realize leader is gone, and send heartbeat message to another Coordinator specified in druid.coordinator.hosts until a new leader is elected.

How does Overlord keep track of running tasks?

Middle Manager could announce tasks when they started, and Overlord could periodically query Middle managers for the latest task status.

All the questions are open, feel free share your thoughts.

pjain1 commented 8 years ago

:+1: for the proposal

Regarding the question

"How does Druid recover in case of Coordinator leader crashes?"

It is difficult to maintain a static list of coordinator hosts names in a runtime property because if new coordinator joins then all the properties files needs to be changed on all nodes and possibly nodes needs to be restart. IMO instead the Druid nodes should wait for another Coordinator to announce its leadership if they know that the current leader has died. Raft promises that leader election is pretty therefore the Druid nodes can continue working independently without any coordinator leader for a short amount of time just like they do now if no coordinator is present. This way nodes does not have to maintain list of coordinators and also do not need to send heartbeats to any random coordinator if no leader is present.

pjain1 commented 8 years ago

For "How to assign a segment to Historical ?" there is a similar proposal here - https://github.com/druid-io/druid/issues/2314

drcrallen commented 8 years ago

Minor book keeping comments:

xvrl commented 8 years ago

Here are a few things I think we should keep in mind as part of this discussion:

  1. I believe we need to separate the notion of node announcement and the other coordination requirements such as segment/task announcement. Currently we do not make that distinction, which leads to service discovery being abused to discover nodes internally and the lack of a general way of discovering other nodes within Druid, which is something that we've repeatedly needed for router -> master, overlord -> master, worker -> overlord, tranquility -> overlord discovery. More on the subject here https://github.com/druid-io/druid/issues/2040
  2. Very importantly, we need to make sure we provide a way to do rolling upgrades while transitioning away from ZK.
  3. We need to keep in mind that there are very large clusters out there, so we need to make sure we don't cause massive cross-talk when dealing with a large number of segments. Having each broker query the coordinator for the segment list sounds like a really bad idea. For instance, we have about 3 million segments in our cluster, which would require a broker to pull roughly 5GB worth of data from the coordinator every time it restarts. This will likely cause the coordinator to fall over, given that memory requirement is already very high on the coordinator.
  4. ZK is probably still one of the few production ready services out there to do leader election and node announcements. While I agree most things should probably be moved off of ZK, until there are production ready java implementations of raft out there, re-implementing node announcement and leader-election should probably be the last things we tackle, given the inherent difficulty in getting those right.
liqweed commented 8 years ago

Have you considered using Atomix? It's an Apache 2.0 licensed Java library with an embeddable Raft implementation (Copycat) with support for leader election, cluster membership changes and bi-directional client sessions. Atomix is verified by Jepsen testing for linearizability in failure scenarios (maintained here).

guobingkun commented 8 years ago

@liqweed Thanks for the info! I will check it out.

gianm commented 8 years ago

@guobingkun @liqweed Copycat did come up in a survey we did last year and we did not use it at the time for a few reasons,

It seems to have matured a lot since then (I don't recall the jepsen tests existing back then) and we are probably somewhat closer to being ready to require java 8. Although, the github site still says "beta - not production ready". @liqweed are you involved with the project? Do you have any insight into what exactly that means?

liqweed commented 8 years ago

@gianm Sorry, I'm not involved with the project (though tempted), but find it very relevant for the purposes of a product I'm involved with.

My take on the "beta" label (before contacting the author) - it seems further ahead of any other embeddable solution I'm aware of. Its testing seems sound and API well thought out. I can see how Java 8 can be a deal breaker, but if Druid makes that step (which I'm sure can be further leveraged) then I would at the very least favor embedding it rather than implementing it from scratch. I think any investment in closing gaps found in its implementation would be dwarfed by the investment it would take to write it from scratch and then perfecting that implementation, not to mention the benefit to the community.

Are you considering making this layer pluggable, so that leader election could fallback on an embedded implementation but otherwise could be configured to use ZK, etcd, Serf etc?

gianm commented 8 years ago

@liqweed agree that Copycat does seem the farthest along of any embeddable JVM consensus system, I would love to hear if anyone has run it in production long enough to see if it can match ZK in reliability (ZK's record is not perfect but it's pretty good).

I'm not sure about other folks, but I was not really imagining this layer being pluggable, as that seems like more trouble than it's worth. I think in the long run we could merge the consensus system and the metadata store and have something that people are generally happy with and don't feel the need to plug in other solutions.

F21 commented 8 years ago

Have there been any updates to this?

julienmathevet commented 7 years ago

+1

himanshug commented 7 years ago

I recently started working on this to let coordinators discover historicals and realtime nodes (its a TODO in https://github.com/druid-io/druid/pull/3902 ). I am looking into copycat and will see if we can use it now that we are gonna become java-8 only soonish.

stale[bot] commented 5 years ago

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 2 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

stale[bot] commented 5 years ago

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.