cooperative-computing-lab / cctools

The Cooperative Computing Tools (cctools) enable large scale distributed computations to harness hundreds to thousands of machines from clusters, clouds, and grids.
http://ccl.cse.nd.edu
Other
135 stars 120 forks source link

resource_monitor parrot_run cmd with bogus messages #959

Closed btovar closed 9 years ago

btovar commented 9 years ago

resource_monitor reads messages from children processes with a socket. When monitoring parrot, the resource_monitor is getting bogus messages, with id = 0 (BRANCH, that is, fork) and error = 2. A branch message cannot generate error code 2, thus the message is bogus. A current band-aid is to start counting message ids after 0.

I am not sure whether the bug is in the monitor, in parrot, or both.

btovar commented 9 years ago

The 'good news' is that the messages are not random. They always happen in the same place, and with the same content. In master, if you run:

resource_monitor parrot_run ls

You'll see that parrot dies. This is because the monitor thinks the bogus message has a valid error code.

batrick commented 9 years ago

I don't see it on Arch:

Linux neverwinter.batbytes.com 4.2.2-1-ARCH #1 SMP PREEMPT Tue Sep 29 22:21:33 CEST 2015 x86_64 GNU/Linux

I'll give it a try on RHEL.

batrick commented 9 years ago

No luck on ccl12 either. Where are you running this?

btovar commented 9 years ago

On cclws16: Linux cclws16 2.6.32-431.11.2.el6.x86_64 #1 SMP Mon Mar 3 13:32:45 EST 2014 x86_64 x86_64 x86_64 GNU/Linux

Commit: ef3ab79b02b2ed054026565e18bd6e37c02be722

btovar commented 9 years ago

@batrick: it is not parrot, found a problem in the resource monitor.

btovar commented 9 years ago

Closed via 5ae701a5d387bbf918793baa8611c7