How can achieve HA/Failover of scheduler

nwong4932 commented 5 years ago

Hi,

My company is interested in using Bistro for our task distributed system. We are reading the design and code of bistro, one important factor for us is how to achieve high availability of scheduler. Can you let me know if this is implemented? If not, how can I achieve it, where is the best place I can add HA logic?

Best regards Nathan

snarkmaster commented 5 years ago

Thanks for your interest!

tl;dr — yes, Bistro is designed for high availability, for some definition of that phrase. It's not a real-time scheduler, so failover is not instant, but it's good enough for its various use-cases within Facebook. And our hardware fails, machines get rebooted, etc, without disrupting system operation.

I assume that you would have some kind of orchestration system restart the Bistro scheduler (possibly on a new host) if the current scheduler goes down or becomes unhealthy. When you do this, you'll also need to flip the scheduler discovery mechanism (in the OSS release, this is unfortunately DNS or a hardcoded IP) to point at the new scheduler instance, so that workers can start using it.

Is that what you mean by "HA"? Or are you going for zero downtime?

By the way, having to point the scheduler discovery mechanism (DNS?) at your new instance means the availability of the new instance could be subject to DNS caching latency. If you have a fancier discovery service, I'd be happy to discuss a patch to allow workers to use that instead of DNS. FB has a low-latency service discovery mechanism internally, so DNS latency isn't a problem for our failovers.

Bistro is designed[1] so that scheduler restarts do not affect the running workload. While the scheduler is down, no new work will start, but scheduling new tasks should[2] resume reasonably quickly after a restart.

Additionally, when a scheduler restarts, its API (and thus web UI) will be unavailable for a few seconds (usually, DNS latency will be worse anyway).

Unless orchestration mechanism uses a global lock (e.g. a Zookeeper-based wrapper) to ensure one and only one Bistro scheduler is running, you might accidentally start two schedulers at once. This is OK. Bistro's scheduler-worker communication protocol is designed to be robust against this, in the sense that workers will only take commands from the scheduler that the discovery mechanism (i.e. DNS) currently tells them to use, and will ignore the other instance.

Footnotes

[1] Seamless recovery from a scheduler restart requires a persistent TaskStore (see e.g. https://github.com/facebook/bistro/blob/master/bistro/statuses/SQLiteTaskStore.h). This is a very simple piece of code interacting with a SQL DB — the one that Bistro uses in production uses an FB-internal MySQL client, which is MySQLTaskStore it's not part of the open-source release. If there's interest, I'd be happy to help rectify this.

[2] You really don't have to understand this footnote deeply to start using Bistro (most client teams internally do not), but I prefer to be explicit about potential pitfalls. The time from a schedule restart until the scheduler will run new tasks is, unfortunately, a little complicated. It's called "initial wait" in the code & logs — it's the sum of a variety of timeouts. In the common case, it takes a few heartbeat intervals to recover, but if you have tens of thousands of workers, or other performance bottlenecks on the scheduler, the wait can be worse. This wait is due to a (in retrospect less-than-practical) design decision that I made. Specifically, Bistro does not use a DB to store its current worker set, and instead uses a distributed consensus protocol to derive it. To learn more about this whole problem, you could start here: https://github.com/facebook/bistro/blob/master/bistro/remote/README.worker_set_consensus — feel free to follow up with specific questions if the worker tracking issue is of concern.

nwong4932 commented 5 years ago

Hi, Thank you for your quick response. We don't have orchestration tool, our in house solution using master/slave for failover, when master is down, all workers of the master auto switch to slave's IP. Do you think it is doable for Bistro.

Best regards Nathan

snarkmaster commented 5 years ago

What you are describing is doable for Bistro (some code would have to be written).

Let me start with some questions: (a) How does a worker determine when the master is down? (b) Is there a global consensus on which of the active schedulers is the master? How is that maintained? (c) What do schedulers (and workers) do in the case of a network partition?

My off-the-cuff answers for these are as follows: (a) Some high-availability global consensus service (like Zookeper) stores the current master IP. The workers just query that. (b) Global consensus is required. A Zookeeper lock must be held by the master. When the lock is lost, the master must shut down (protects against a network partition). (c) See (b).

If you have something else in mind, please spell it out so we can discuss its merits.

First and foremost, I'm not sure of the benefit of running several copies of the scheduler all the time, when only one is active. Are we trying to reduce the failover time? What is the acceptable fail-over time? Are you sure this is not achievable by starting a new scheduler as needed? Then, instead of slaves, you could just have a handful of hosts that periodically check master health, and compete for the corresponding Zookeper lock. A new master would only start when the old one loses the lock. This "start a scheduler" wrapper would also be responsible for terminating the old master if it loses the lock (e.g. in case of network parition).

In other words, above, I'm suggesting putting most of the master-selection logic into a wrapper around Bistro scheduler. I'd still be glad to take a patch to add the wrapper to this github repo, and provide feedback.

Of course, with the above ZK-based wrapper, we'd still need a small patch to the workers to enable them to query Zookeeper (or similar) to determine the current master.

With the solution described above, I should add that one could also consider using ZK to keep locks for active workers. This would eliminate the "initial wait" latency problem that I alluded to above. I probably should have done that from the start. Luckily, the change to explicitly track workers is actually much easier than the distributed consensus thing that Bistro currently implements. I'd be glad to discuss this more if useful to you.

nwong4932 commented 5 years ago

Thank you for your points. We need re think how to proper handle failover. I am summarizing all gaps between our use case and Bistro. I will create new issue. Thanks!

snarkmaster commented 5 years ago

Ok, I'll close this out for now! Thanks for the questions.

facebookarchive / bistro

How can achieve HA/Failover of scheduler #32

Footnotes