Sometimes scheduler gets error No initial worker set ID consensus

facebookarchive / bistro

Bistro is a flexible distributed scheduler, a high-performance framework supporting multiple paradigms while retaining ease of configuration, management, and monitoring.

https://bistro.io

MIT License

1.03k stars 158 forks source link

Sometimes scheduler gets error No initial worker set ID consensus #15

Closed ghost closed 6 years ago

ghost commented 7 years ago

Sometime it gets error and does not run tasks: W0517 08:44:53.419775 11057 RemoteWorkerRunner.cpp:89] RemoteWorkerRunner initial wait (/home/user/src/bistro/bistro/runners/RemoteWorkerRunner.cpp:75): No initial worker set ID consensus. Waiting for all workers to connect before running tasks.

It sometime seems to work more consistently if worker is started (fully) before scheduler(?)

Scheduler startup:

$HOME/src/bistro/bistro/cmake/Debug/server/bistro_scheduler \
  --server_port=6789 \
  --http_server_port=6790 \
  --config_file=/etc/bs/config.json \
  --clean_statuses \
  --CAUTION_startup_wait_for_workers=700 \
  --instance_node_name=scheduler

Worker startup:

$HOME/src/bistro/bistro/cmake/Debug/worker/bistro_worker \
  --server_port=27182 \
  --scheduler_host=:: \
  --scheduler_port=6789 \
  --worker_command="/etc/bs/default_task.sh" \
  --data_dir=/tmp/bistro_worker

snarkmaster commented 7 years ago

It should run tasks eventually. How long do you wait? It looks like your timeouts are all at default, so the initial wait should not be too long.

Can you tell me more about your use-case -- how often you restart the workers / scheduler? Also give a general idea of the workload in terms of task duration, number of tasks & jobs, etc?

This is actually one of my least favorite parts of the system. The design goal is to avoid starting two copies of the same task on several workers (due to network partitions, etc). However, this is done in a database-free way, so the scheduler does not persist its list of "currently connected workers" anywhere. Instead, on startup, it waits for "all" workers to connect before running any new tasks. This wait is long enough that workers behind a network partition would have killed their tasks.

There are a lot of timeouts that you can to configure to set a tradeoff between:

how long tasks are allowed to run when a scheduler is unreachable
how long it takes for a newly started scheduler to start being able to run tasks again ("initial wait")
how likely you are to start a second copy of a task due to e.g. Linux kernel issues (even SIGKILL is imperfect).

The defaults are reasonable but probably not perfect for any application. That's why I'm inviting you to discuss your goals here.

The design is documented here: https://github.com/facebook/bistro/blob/master/bistro/remote/README.worker_set_consensus

However, you might need to start here: https://github.com/facebook/bistro/blob/master/bistro/if/README.worker_protocol

ghost commented 7 years ago

@snarkmaster , I think I waited for maybe about 2-3 minutes(?) I was wondering if my ctrl+C on the scheduler left something "left over"? My build seems broken for some reason right now, so I haven't gotten a chance to retest.

snarkmaster commented 7 years ago

Ctrl-Cing the scheduler would not leave any state behind. The default wait is actually kind of long:

  const time_t kMinSafeWait =
    RemoteWorkerState::maxHealthcheckGap() +
    RemoteWorkerState::loseUnhealthyWorkerAfter() +
    RemoteWorkerState::workerCheckInterval() +  // extra safety gap
    RemoteWorkerState::workerSuicideBackoffSafetyMarginSec() +
    (RemoteWorkerState::workerSuicideTaskKillWaitMs() / 1000) + 1;

60 + 60 + 5 +
500 +
5 + 
60 + 
5 + 1 = 696 or about 12 minutes

Depending on your priorities, you could lower your --lose_unhealthy_worker_after.

For quick experimentation, you can also set CAUTION_exit_initial_wait_before_timestamp in your bistro_settings to quickly get the scheduler going.

I already pointed you at the docs about why these timeouts exist, let me know if you have any questions.

Unrelatedly: Why are you passing --CAUTION_startup_wait_for_workers? There are very few situations in which you want to be changing this flag. I would remove it.

snarkmaster commented 6 years ago

I'll close this out, but feel free to reopen if you'd like to discuss further.