camsas / firmament

The Firmament cluster scheduling platform
Apache License 2.0
415 stars 79 forks source link

Enabling flow scheduler crashes during resource topology addition #14

Closed ms705 closed 9 years ago

ms705 commented 9 years ago

When running the coordinator with a flow scheduler and any cost model, it crashes on startup (as of recent versions):

$ build/engine/coordinator --logtostderr --scheduler=quincy --v=3 --flow_scheduling_cost_model=6 --debug_flow_graph
[...]
F0329 15:29:24.509809 19645 octopus_cost_model.cc:34] Check failed: 'dst_rs_ptr' Must be non NULL 
*** Check failure stack trace: ***
    @     0x7fb271dc9996  google::DumpStackTraceAndExit()
    @     0x7fb271dc10fd  google::LogMessage::Fail()
    @     0x7fb271dc2fb2  google::LogMessage::SendToLog()
    @     0x7fb271dc0c9f  google::LogMessage::Flush()
    @     0x7fb271dc384e  google::LogMessageFatal::~LogMessageFatal()
    @           0x7f4554  google::CheckNotNull<>()
    @           0x80de85  firmament::OctopusCostModel::ResourceNodeToResourceNodeCost()
    @           0x837122  firmament::FlowGraph::ConfigureResourceBranchNode()
    @           0x836d40  firmament::FlowGraph::AddResourceNode()
    @           0x84383a  boost::_mfi::mf1<>::operator()()
    @           0x843790  boost::_bi::list2<>::operator()<>()
    @           0x8436e2  boost::_bi::bind_t<>::operator()<>()
    @           0x843488  boost::detail::function::void_function_obj_invoker1<>::invoke()
    @           0x87ba20  boost::function1<>::operator()()
    @           0x87adb2  firmament::BFSTraverseResourceProtobufTreeReturnRTND()
    @           0x836034  firmament::FlowGraph::AddResourceTopology()
    @           0x83a672  firmament::FlowGraph::UpdateResourceNode()
    @           0x84383a  boost::_mfi::mf1<>::operator()()
    @           0x843790  boost::_bi::list2<>::operator()<>()
    @           0x8436e2  boost::_bi::bind_t<>::operator()<>()
    @           0x843488  boost::detail::function::void_function_obj_invoker1<>::invoke()
    @           0x87ba20  boost::function1<>::operator()()
    @           0x87adb2  firmament::BFSTraverseResourceProtobufTreeReturnRTND()
    @           0x832ac0  firmament::FlowGraph::UpdateResourceTopology()
    @           0x832a1d  firmament::FlowGraph::AddMachine()
    @           0x7f0f95  firmament::scheduler::QuincyScheduler::UpdateResourceTopology()
    @           0x7f3026  firmament::scheduler::QuincyScheduler::RegisterResource()
    @           0x6f0718  firmament::Coordinator::AddResource()
    @           0x72fe13  boost::_mfi::mf3<>::operator()()
    @           0x72fd58  boost::_bi::list4<>::operator()<>()
    @           0x72fc62  boost::_bi::bind_t<>::operator()<>()
    @           0x72f9bb  boost::detail::function::void_function_obj_invoker1<>::invoke()

It looks to me like we've got two nested BFS traversals going on here, one triggered from FlowGraph::UpdateResourceTopology and one from FlowGraph::AddResourceTopology, which seems a bit funky. I believe this was introduced by 502af61f4b1b47b6ae0d5f120716446d0df50b05 via FlowGraph::AddMachine, but it's not the root of the problem -- if I change FlowGraph::AddMachine to call FlowGraph::AddResourceTopology directly, things still fail.

Any ideas? (@ICGog?)

ms705 commented 9 years ago

Hmm, this appears to be related to a local change of mine (i.e. upstream isn't broken).

I changed the signature of AddResource to take a RTND* rather than a ResourceDescriptor* in order to be able to traverse the resource topology in the cost models. As part of this, I replaced instances of DFSTraverseResourceProtobufTree with DFSTraverseResourceProtobufTreeReturnRTND, which seems to introduce the error.

I'll debug this further; the change should be innocuous, but clearly introduces problems.

And relatedly, I'm nevertheless unsure that the two nested BFS traversals are required.

ms705 commented 9 years ago

Fixed in a5c08d8be57a3f03387862e9ecd07e77b6c712fc -- the problem was the nesting of DFS and BFS invocations. Transforming the outermost invocation to a BFS fixed the problem (modulo the necessity of nested traversals).