Closed ghost closed 6 years ago
It should run tasks eventually. How long do you wait? It looks like your timeouts are all at default, so the initial wait should not be too long.
Can you tell me more about your use-case -- how often you restart the workers / scheduler? Also give a general idea of the workload in terms of task duration, number of tasks & jobs, etc?
This is actually one of my least favorite parts of the system. The design goal is to avoid starting two copies of the same task on several workers (due to network partitions, etc). However, this is done in a database-free way, so the scheduler does not persist its list of "currently connected workers" anywhere. Instead, on startup, it waits for "all" workers to connect before running any new tasks. This wait is long enough that workers behind a network partition would have killed their tasks.
There are a lot of timeouts that you can to configure to set a tradeoff between:
The defaults are reasonable but probably not perfect for any application. That's why I'm inviting you to discuss your goals here.
The design is documented here: https://github.com/facebook/bistro/blob/master/bistro/remote/README.worker_set_consensus
However, you might need to start here: https://github.com/facebook/bistro/blob/master/bistro/if/README.worker_protocol
@snarkmaster , I think I waited for maybe about 2-3 minutes(?) I was wondering if my ctrl+C on the scheduler left something "left over"? My build seems broken for some reason right now, so I haven't gotten a chance to retest.
Ctrl-C
ing the scheduler would not leave any state behind. The default wait is actually kind of long:
const time_t kMinSafeWait =
RemoteWorkerState::maxHealthcheckGap() +
RemoteWorkerState::loseUnhealthyWorkerAfter() +
RemoteWorkerState::workerCheckInterval() + // extra safety gap
RemoteWorkerState::workerSuicideBackoffSafetyMarginSec() +
(RemoteWorkerState::workerSuicideTaskKillWaitMs() / 1000) + 1;
60 + 60 + 5 +
500 +
5 +
60 +
5 + 1 = 696 or about 12 minutes
Depending on your priorities, you could lower your --lose_unhealthy_worker_after
.
For quick experimentation, you can also set CAUTION_exit_initial_wait_before_timestamp
in your bistro_settings
to quickly get the scheduler going.
I already pointed you at the docs about why these timeouts exist, let me know if you have any questions.
Unrelatedly: Why are you passing --CAUTION_startup_wait_for_workers
? There are very few situations in which you want to be changing this flag. I would remove it.
I'll close this out, but feel free to reopen if you'd like to discuss further.
Sometime it gets error and does not run tasks: W0517 08:44:53.419775 11057 RemoteWorkerRunner.cpp:89] RemoteWorkerRunner initial wait (/home/user/src/bistro/bistro/runners/RemoteWorkerRunner.cpp:75): No initial worker set ID consensus. Waiting for all workers to connect before running tasks.
It sometime seems to work more consistently if worker is started (fully) before scheduler(?)
Scheduler startup:
Worker startup: