camsas / firmament

The Firmament cluster scheduling platform
Apache License 2.0
415 stars 79 forks source link

Submitting >1 job crashes the coordinator #6

Closed ms705 closed 11 years ago

ms705 commented 11 years ago

This is due to all root tasks currently having the same name, as it is based on hashing the creating task ID (which is zero for the root task on job submission).

The engine, however, assumes that it has never seen a task before, and will CHECK-fail if it sees the same TaskID arriving again.

Required fixes: 1) Change hashing scheme such that different jobs' root tasks have different names. 2) Sensibly deal with resubmissions of the same task:

ICGog commented 11 years ago

Example stack trace:

F1130 14:52:55.849515 10305 simple_scheduler.cc:54] Check failed: InsertIfNotPresent(&taskbindings, task_desc->uid(), ResourceIDFromString(res_desc->uuid())) * Check failure stack trace: * @ 0x7f7902ebf7b9 google::LogMessage::SendToLog() @ 0x7f7902ebfc37 google::LogMessage::Flush() @ 0x7f7902ec2ac2 google::LogMessageFatal::~LogMessageFatal() @ 0x4c8412 firmament::scheduler::SimpleScheduler::BindTaskToResource() @ 0x4cc9e4 firmament::scheduler::SimpleScheduler::ScheduleJob() @ 0x43881b firmament::Coordinator::SubmitJob() @ 0x48656f firmament::webui::CoordinatorHTTPUI::HandleJobSubmitURI() @ 0x4902c2 boost::_mfi::mf2<>::operator()() @ 0x49022a boost::_bi::list3<>::operator()<>() @ 0x49016a boost::_bi::bind_t<>::operator()<>() @ 0x48fedd boost::detail::function::void_function_obj_invoker2<>::invoke() @ 0x7f7901e18a97 (unknown) @ 0x7f7901e125cd (unknown) @ 0x7f7901e18c0a (unknown) @ 0x7f7901e1068f (unknown) @ 0x7f7901e10b00 (unknown) @ 0x7f7901e1a905 (unknown) @ 0x7f7902072dc2 (unknown) @ 0x7f790279dce9 (unknown) @ 0x7f7902ca0e9a start_thread @ 0x7f79009eecbd (unknown) Aborted (core dumped)

ms705 commented 11 years ago

Workaround committed in 710ee79a1402b4834f9b51e667dcee405cf580cb; this generates root task IDs based on the (currently randomly generated) job name. The consequence, and drawback, of this is that we now no longer have deterministic task naming. A better scheme would be to generate the names based on the inputs to a task, and optionally the parent task (like in CIEL), achieving determinism.

On a wider scale, crashes were caused by the bug fixed in 30d46c78fad74f3bbaefada6b5f3baaec5c56615, which turned out to be caused by an incorrect parsing of the TASK_ID environment variable in task_lib.cc (it was interpreted as a signed long, not a uint64_t).

f9aa306185d75c76f689f1ef8a306986675e74de also updates the web UI to correctly display unsigned 64-bit integer IDs.