camsas / firmament

The Firmament cluster scheduling platform
Apache License 2.0
415 stars 79 forks source link

Fix to handle the crash if sysfs network speed pseudofile is unavailable. #50

Closed shivramsrivastava closed 7 years ago

shivramsrivastava commented 7 years ago

Fix for issue #48.

ms705 commented 7 years ago

Hi @shivramsrivastava,

Thanks for the PR! Can you try pushing it to GerritHub for review? (The automatic PR import unfortunately fails for this one).

$ git remote add codereview https://review.gerrithub.io/camsas/firmament
$ git push codereview HEAD:refs/for/master

If you encounter permission issues, let us know -- I think anyone is allowed to push, but Gerrit permissions are a bit byzantine :wink:

shivramsrivastava commented 7 years ago

I tried to push it on GerritHub, but couldn't.

ubuntu@:~/workspace/src/firmament$ git remote add codereview https://review.gerrithub.io/ms705/firmament
ubuntu@:~/workspace/src/firmament$ git push codereview HEAD:refs/for/master
Username for 'https://review.gerrithub.io': shivramsrivastava
Password for 'https://shivramsrivastava@review.gerrithub.io':
fatal: Authentication failed for 'https://review.gerrithub.io/ms705/firmament/'

Also tried to clone from Gerrit repo and push but that also failed.

This project repo doesn't exist. https://review.gerrithub.io/camsas/firmament

So tried with 'https://review.gerrithub.io/ms705/firmament' ?

ms705 commented 7 years ago

Hi @shivramsrivastava,

Ah, this is a consequence of us having moved the project to the CamSaS organization account a while ago. Good catch! I've added it correctly on Gerrithub now, so adding https://review.gerrithub.io/camsas/firmament as a remote should work now. The PR import also worked, so we have the change in Gerrit already.

PR review in 298589. @ICGog, can you also take a look?

Thanks!

ms705 commented 7 years ago

I've left some review comments. To address them, do the following:

With the Gerrithub repo (https://review.gerrithub.io/camsas/firmament) as a remote called codereview, first move to the relevant change (you can get this command in the Gerrit changeset review UI via "Download" -> "Checkout":

$ git fetch https://review.gerrithub.io/camsas/firmament refs/changes/89/298589/1 && git checkout FETCH_HEAD

Then make your edits, and commit them as an amendment to the existing change:

$ git add -p
$ git commit --amend

And finally push an updated change to Gerrithub:

$ git push codereview HEAD:refs/for/master
shivramsrivastava commented 7 years ago

Thanks for the review @ms705 , i will do the changes.

shivramsrivastava commented 7 years ago

We had some issues with our machines, so couldn't check in the code last week. We are getting this following error after we refreshed our branch with the latest changes we are getting a SIGSEGV.

#0  0x00000000007688ea in google::protobuf::Message::GetDescriptor() const ()
#1  0x000000000089bbdd in google::protobuf::internal::ReflectionOps::Merge(google::protobuf::Message const&, google::protobuf::Message*) ()
#2  0x00000000007688ff in google::protobuf::Message::GetDescriptor() const ()
#3  0x000000000089e594 in google::protobuf::TextFormat::Parser::Parse(google::protobuf::io::ZeroCopyInputStream*, google::protobuf::Message*) ()
#4  0x000000000089e6b4 in google::protobuf::TextFormat::Parser::ParseFromString(std::string const&, google::protobuf::Message*) ()
#5  0x000000000089ed54 in google::protobuf::TextFormat::ParseFromString(std::string const&, google::protobuf::Message*) ()
#6  0x00000000006a5d28 in firmament::webui::CoordinatorHTTPUI::HandleJobSubmitURI (this=0xc51d50, http_request=..., tcp_conn=...)
    at /home/ubuntu/workspace/src/firmament/src/engine/coordinator_http_ui.cc:157

Line where the error occurs.

156       google::protobuf::TextFormat::ParseFromString(job_descriptor_param,
157                                                     &job_descriptor);

Content of 'job_descriptor_param'.

$9 = "name: \"anonymous_job_at_1477412128\"root_task {  name: \"root_task\"  dependencies {    id: \"\\376\\355\\312\\376\\336\\255\\276\\357\\376\\355\\312\\376\\336\\255\\276\\357\\376\\355\\312\\376\\336\\255\\276\\357\\376\\355\\312\\376\\336\\255\\276\\357\"    type: CONCRETE    location: \"blob:/tmp/fib_in\"  }  outputs {    id: \"\\3333\\332\\272(\\r\\216h\\356\\246\\344\\220r;\\002\\316\\3333\\332\\272(\\r\\216h\\356\\246\\344\\220r;\\002\\316\"    type: FUTURE    non_deterministic: true    location: \"blob:/tmp/out1\"  }  outputs {    id: \"\\376\\355\\312\\376\\336\\255\\276\\357\\376\\355\\312\\376\\336\\255\\276\\357\\376\\355\\312\\376\\336\\255\\276\\357\\376\\355\\312\\376\\336\\255\\276\\357\"    type: FUTURE    non_deterministic: true    location: \"blob:/tmp/out2\"  }  binary: \"openssl\"  args: \"speed\"  args: \"sha512\"  inject_task_lib: true  resource_request {    cpu_cores: 0.1    ram_cap: 128  }  priority: 5}output_ids: \"\\3333\\332\\272(\\r\\216h\\356\\246\\344\\220r;\\002\\316\\3333\\332\\272(\\r\\216h\\356\\246\\344\\220r;\\002\\316\"output_ids: \"\\376\\355\\312\\376\\336\\255\\276\\357\\376\\355\\312\\376\\336\\255\\276\\357\\376\\355\\312\\376\\336\\255\\276\\357\\376\\355\\312\\376\\336\\255\\276\\357\""

Is this happening because of the changes relating to proto3 ?

ms705 commented 7 years ago

Hi @shivramsrivastava,

Yes, this looks like a protobuf3-related error. I will take a look and try to reproduce it.

If you check out the change on Gerrithub (git fetch https://review.gerrithub.io/camsas/firmament refs/changes/89/298589/1 && git checkout FETCH_HEAD), you should be able to go back to the code as it was when you made the changes. However, you should remove your build tree and rebuild from scratch if you do so. Once you've updated the changeset, we can rebase the change and merge it with the upstream changes.

ms705 commented 7 years ago

@shivramsrivastava

Actually, my bad -- the patch set has already been rebased onto the protobuf3 code base. Nevertheless, you can push the changes there, since they're stylistic and unrelated to protobuf3. I'll investigate the issue you're experiencing, though.

shivramsrivastava commented 7 years ago

@ms705 You want me to checkout from gerrit or push from the current rebase'd branch?

ms705 commented 7 years ago

@shivramsrivastava

  1. In a working copy without any outstanding changes, do a checkout of the change from Gerrit using the above command.
  2. Your working copy will then be at your original commit, and you can make changes.
  3. Add the changes into the commit via git commit --amend.
  4. Finally, you push the (modified) commit into Gerrit (git push codereview HEAD:refs/for/master).
  5. If you have any local commits that depend on your original commit, you will need to rebase those.

Does that help?

shivramsrivastava commented 7 years ago

Ok will do that. Thanks @ms705.

shivramsrivastava commented 7 years ago

Looks like I still don't have access to push code to Gerrit repo?

ubuntu@firmament-master:~/workspace/src/firmament$ git push codereview HEAD:refs/for/master
Username for 'https://review.gerrithub.io': shivram.srivastava@huawei.com
Password for 'https://shivram.srivastava@huawei.com@review.gerrithub.io':
fatal: Authentication failed for 'https://review.gerrithub.io/camsas/firmament/'
ms705 commented 7 years ago

@shivramsrivastava Ah, you might need to connect Gerrit to your GitHub account. Just go to http://gerrithub.io/, and click "First time sign-in". When asked to import GitHub repos, you can either skip the step (importing nothing) or import your Firmament fork. After you've connected Gerrit to your GitHub account, you can authenticate using your GitHub credentials (user: shivramsrivastava) or even use SSH keys to push.

I'll make a note to better document this process.

shivramsrivastava commented 7 years ago

@ms705 I remember doing import the first I logged in gerrit. I think the firmament gerrit link was not correct and I imported others.

I will check again.

ms705 commented 7 years ago

@shivramsrivastava I just double-checked, and there was still a misconfiguration in the Gerrit permissions -- sorry about that! Can you try again?

ms705 commented 7 years ago

@shivramsrivastava Did you have a chance to check this again? Sorry about the complication -- we would love to get the patch merged.

shivramsrivastava commented 7 years ago

@ms705 Sorry for the late reply. I had checked-in the code in gerrit on Thursday itself. I though gerrit will send mails to the reviewers? I think after the last permission setting which you did I was able to check-in the code in gerrit.

Its a festive season in india, we are on a long holiday. I didn't see your mails. If you have further review comments, please let me know. I will do it at the earliest.

shivramsrivastava commented 7 years ago

@ms705 i just added the reviewers in gerrit, i didn't see that was empty. https://review.gerrithub.io/#/c/299907/

ms705 commented 7 years ago

@shivramsrivastava Thanks, and no worries -- enjoy the festive season!

I've left two more minor review comments, and we should be good to merge after those are addressed :+1:

P.S. When pushing an update to a change in Gerrit, please try to push to the old change set when addressing review comments, rather than creating a new one. Importantly, don't remove the the "Change-Id" line from the commit message, as that is what Gerrit uses to match pushes to existing changes. Thanks!

shivramsrivastava commented 7 years ago

@ms705 I left a comment. 'https://review.gerrithub.io/#/c/299907/1/src/platforms/unix/procfs_machine.h' looks like Gerrit is not sending mail if there is a reply to a comment?

Malte Schwarzkopf
If you're resetting errno anyway, these lines can be simplified to just two lines: fscanf(input, "%ju", x); return errno; Oct 30 9:17 PM
Draft
Nov 4 4:07 PM
If we don't handle the return from 'fscanf', it shows as a warning message during compile time.
If we reset the 'errno' upon entry in the function, we should also reset it before exiting the function, since 'errno' are not rest automatically and this could cause other function to crash, which check the 'errno' and which don't reset the 'errno' upon entry?
Once a new 'errno' is set either the calling function or the function which set the 'errno' should reset the 'errno' or the 'errno' will be floating and it might cause a crash in other function.
shivramsrivastava commented 7 years ago

@ms705 I have done the review changes you requested and pushed it in Gerrit. https://review.gerrithub.io/#/c/299907/

ms705 commented 7 years ago

@shivramsrivastava Gerrit sends emails once you click the "Reply" button on the main page for the change, not for each individual comment. We did get an email around the time you posted here.

Your fix is now merged as 3aeb8d4094c6805f8b904b777c54ac349e491df4; this PR refers to the old code in 298589, so I'm closing it. Thanks for the contribution!