Starting subordinate coordinators with flow scheduler causes crash

camsas / firmament

The Firmament cluster scheduling platform

Apache License 2.0

415 stars 79 forks source link

When you start a "master" node, and a "worker" node both using the flow scheduler (e.g.:

Master node starts with:

build/src/coordinator --scheduler flow --flow_scheduling_cost_model 6 --listen_uri tcp:0.0.0.0:8000 --task_lib_dir=$(pwd)/build/src --v=2

and Worker node starts with:

build/src/coordinator --scheduler flow --flow_scheduling_cost_model 6 --parent_uri tcp:firmament.masternode.com:8000 --listen_uri tcp:0.0.0.0:8000 --task_lib_dir=$(pwd)/build/src --v=2

and submit a job (for example: python job_submit.py localhost 8080 /bin/sleep 60 (on the master node)). It leads to a crash inside the worker node.

Talking to @ms705 it seems the root of the problem lies in the worker nodes recalculation of the flow:

What *should* happen is that the subordinate coordinator ends up
running a flow scheduler itself that it can use to schedule in its more
restricted window of visibility into the cluster state; remotely-placed
tasks would have to be reflected in that flow scheduler's flow graph,
which they aren't (hence the error).

This issue arises due to bugs in the code that handles delegated tasks: the flow scheduler does not correctly maintain its data structures w.r.t. tasks that were placed on local resources by a superior coordinator.

We didn't previously notice this because we used a single flow scheduler in the cluster -- the flow scheduling approach works best when the whole cluster state is visible to the scheduler -- and ran simple, queue-based schedulers with subordinate coordinators. (This is also the workaround for the bug: use --scheduler=simple for subordinate coordinators.)

Fixing this will require correct handling of delegated tasks in FlowScheduler, i.e., overriding the relevant implementations in EventDrivenScheduler.

camsas / firmament

Starting subordinate coordinators with flow scheduler causes crash #54