camsas / firmament

The Firmament cluster scheduling platform
Apache License 2.0
415 stars 79 forks source link

Starting subordinate coordinators with flow scheduler causes crash #54

Open Mythra opened 7 years ago

Mythra commented 7 years ago

When you start a "master" node, and a "worker" node both using the flow scheduler (e.g.:

Master node starts with:

build/src/coordinator --scheduler flow --flow_scheduling_cost_model 6 --listen_uri tcp:0.0.0.0:8000 --task_lib_dir=$(pwd)/build/src --v=2

and Worker node starts with:

build/src/coordinator --scheduler flow --flow_scheduling_cost_model 6 --parent_uri tcp:firmament.masternode.com:8000 --listen_uri tcp:0.0.0.0:8000 --task_lib_dir=$(pwd)/build/src --v=2

),

and submit a job (for example: python job_submit.py localhost 8080 /bin/sleep 60 (on the master node)). It leads to a crash inside the worker node.

Talking to @ms705 it seems the root of the problem lies in the worker nodes recalculation of the flow:

What *should* happen is that the subordinate coordinator ends up
running a flow scheduler itself that it can use to schedule in its more
restricted window of visibility into the cluster state; remotely-placed
tasks would have to be reflected in that flow scheduler's flow graph,
which they aren't (hence the error). 
ms705 commented 7 years ago

This issue arises due to bugs in the code that handles delegated tasks: the flow scheduler does not correctly maintain its data structures w.r.t. tasks that were placed on local resources by a superior coordinator.

We didn't previously notice this because we used a single flow scheduler in the cluster -- the flow scheduling approach works best when the whole cluster state is visible to the scheduler -- and ran simple, queue-based schedulers with subordinate coordinators. (This is also the workaround for the bug: use --scheduler=simple for subordinate coordinators.)

Fixing this will require correct handling of delegated tasks in FlowScheduler, i.e., overriding the relevant implementations in EventDrivenScheduler.