camsas / firmament

The Firmament cluster scheduling platform
Apache License 2.0
415 stars 79 forks source link

Segfault on job submission after protobuf3 transition #51

Closed ms705 closed 7 years ago

ms705 commented 7 years ago

As reported by @shivramsrivastava in #50:

We are getting this following error after we refreshed our branch with the latest changes we are getting a SIGSEGV.

#0  0x00000000007688ea in google::protobuf::Message::GetDescriptor() const ()
#1  0x000000000089bbdd in google::protobuf::internal::ReflectionOps::Merge(google::protobuf::Message const&, google::protobuf::Message*) ()
#2  0x00000000007688ff in google::protobuf::Message::GetDescriptor() const ()
#3  0x000000000089e594 in google::protobuf::TextFormat::Parser::Parse(google::protobuf::io::ZeroCopyInputStream*, google::protobuf::Message*) ()
#4  0x000000000089e6b4 in google::protobuf::TextFormat::Parser::ParseFromString(std::string const&, google::protobuf::Message*) ()
#5  0x000000000089ed54 in google::protobuf::TextFormat::ParseFromString(std::string const&, google::protobuf::Message*) ()
#6  0x00000000006a5d28 in firmament::webui::CoordinatorHTTPUI::HandleJobSubmitURI (this=0xc51d50, http_request=..., tcp_conn=...)
    at /home/ubuntu/workspace/src/firmament/src/engine/coordinator_http_ui.cc:157

Line where the error occurs.

156       google::protobuf::TextFormat::ParseFromString(job_descriptor_param,
157                                                     &job_descriptor);

Content of 'job_descriptor_param'.

$9 = "name: \"anonymous_job_at_1477412128\"root_task {  name: \"root_task\"  dependencies {    id: \"\\376\\355\\312\\376\\336\\255\\276\\357\\376\\355\\312\\376\\336\\255\\276\\357\\376\\355\\312\\376\\336\\255\\276\\357\\376\\355\\312\\376\\336\\255\\276\\357\"    type: CONCRETE    location: \"blob:/tmp/fib_in\"  }  outputs {    id: \"\\3333\\332\\272(\\r\\216h\\356\\246\\344\\220r;\\002\\316\\3333\\332\\272(\\r\\216h\\356\\246\\344\\220r;\\002\\316\"    type: FUTURE    non_deterministic: true    location: \"blob:/tmp/out1\"  }  outputs {    id: \"\\376\\355\\312\\376\\336\\255\\276\\357\\376\\355\\312\\376\\336\\255\\276\\357\\376\\355\\312\\376\\336\\255\\276\\357\\376\\355\\312\\376\\336\\255\\276\\357\"    type: FUTURE    non_deterministic: true    location: \"blob:/tmp/out2\"  }  binary: \"openssl\"  args: \"speed\"  args: \"sha512\"  inject_task_lib: true  resource_request {    cpu_cores: 0.1    ram_cap: 128  }  priority: 5}output_ids: \"\\3333\\332\\272(\\r\\216h\\356\\246\\344\\220r;\\002\\316\\3333\\332\\272(\\r\\216h\\356\\246\\344\\220r;\\002\\316\"output_ids: \"\\376\\355\\312\\376\\336\\255\\276\\357\\376\\355\\312\\376\\336\\255\\276\\357\\376\\355\\312\\376\\336\\255\\276\\357\\376\\355\\312\\376\\336\\255\\276\\357\""

This error relates to the protobuf3 transition: the job submission script still generates Python code using protobuf2, and the text format therefore becomes incompatible.

There are two solutions:

  1. The immediate fix: use protobuf3 for the job submission script.
  2. Longer-term solution: since protobuf3 supports JSON decoding (one of the reasons we moved to it), we can just move the job submission script to send JSON.
ms705 commented 7 years ago

Having dug into this, it looks like we're facing a deeper issue here -- simply moving to protobuf3 for the job submission script did not help. Decoding a text format or JSON format protobuf sent via HTTP fails with a segfault inside protobuf for me no matter what I do. I'm working on a smaller test case for this, so that we can insolate the responsible piece of code (which could be the pion web server, or protobuf, or the job submission script).

shivramsrivastava commented 7 years ago

@ms705 I wrote a job submission code in go, same as the Python code. While submitting the job I get a 500 error code(internal server error). I will fix that issue tomorrow and test it. Then we can confirm if the problem is with the job script.

ms705 commented 7 years ago

@shivramsrivastava I've actually managed to pin this down -- a fix is in the works.

The problem was related to an old protobuf2 header being used by one of the libraries we link (pb2json). Since we don't need that library any more now that protobuf3 has native JSON support, it can be removed.

ms705 commented 7 years ago

Fix now under review in 299814.

Note that this change also moves the job submission script to using JSON -- you will probably want to update your Go code to do the same.