Open Mythra opened 7 years ago
This issue arises due to bugs in the code that handles delegated tasks: the flow scheduler does not correctly maintain its data structures w.r.t. tasks that were placed on local resources by a superior coordinator.
We didn't previously notice this because we used a single flow scheduler in the cluster -- the flow scheduling approach works best when the whole cluster state is visible to the scheduler -- and ran simple, queue-based schedulers with subordinate coordinators. (This is also the workaround for the bug: use --scheduler=simple
for subordinate coordinators.)
Fixing this will require correct handling of delegated tasks in FlowScheduler
, i.e., overriding the relevant implementations in EventDrivenScheduler
.
When you start a "master" node, and a "worker" node both using the flow scheduler (e.g.:
Master node starts with:
and Worker node starts with:
),
and submit a job (for example:
python job_submit.py localhost 8080 /bin/sleep 60
(on the master node)). It leads to a crash inside the worker node.Talking to @ms705 it seems the root of the problem lies in the worker nodes recalculation of the flow: