camsas / firmament

The Firmament cluster scheduling platform
Apache License 2.0
415 stars 79 forks source link

Output from flowlessly solver invalid on second invocation unless --only_read_assignment_changes is specified #25

Open ms705 opened 9 years ago

ms705 commented 9 years ago

Invoking flowlessly in default mode (i.e., non-incremental) results in output being returned that the dispatchers parsing logic fails to understand because node type information is lacking.

Steps to reproduce:

  1. Invoke the coordinator as build/engine/coordinator --logtostderr --scheduler flow --flow_scheduling_cost_model 2 --v=1 --flow_scheduling_solver=flowlessly --debug_flow_graph
  2. Submit a job.
  3. Submit another job.
  4. Observe the error in the output:
I0626 16:53:01.913985  6803 solver_dispatcher.cc:191] Writing flow graph debug info into /tmp/firmament-debug/debug_1.dm
I0626 16:53:01.914111  6803 utils.cc:307] External execution of command: ext/flowlessly-git/run_fast_cost_scaling --graph_has_node_types=true --global_update=false --daemon=false
I0626 16:53:01.915642  6803 utils.cc:346] Subprocess with PID 7002 created.
E0626 16:53:01.917798  6803 solver_dispatcher.cc:562] Unknown type of row in flow graph: m 38 24
I0626 16:53:01.917911  6803 utils.cc:370] Subprocess with PID 7002 exited with status 0
I0626 16:53:01.917989  6803 flow_scheduler.cc:135] Applying 0 scheduling deltas...

Inspecting the file in question shows that it is an incremental delta:

$ cat  /tmp/firmament-debug/debug-flow_1.dm
m 38 24
c EOI

... but the system isn't expecting assignment changes to be returned.

This suggests to me that we should either:

  1. make the --only_read_assignment_changes flag implicit when --flow_solver is set to "flowlessly";
  2. not do above, but fail if the flag is now set when --flow_solver is set to "flowlessly";
  3. remove the special cases for flowlessly and cs2 and simply allow the user to specify a solver binary plus the appropriate combination of --incremental_flow and --only_read_assignment_changes, making it their responsibility to get it right;
  4. shelve this until the fast delta extraction code has been back-ported into the main code base, at which point flowlessly can return the entire flow, just as cs2 does.

(3) seems painful for the user, and (4) seems inefficient to me. Maybe go for (2)?