flux-framework / flux-sched

Fluxion Graph-based Scheduler
GNU Lesser General Public License v3.0
84 stars 39 forks source link

rs2rank: add support to link resrc to broker rank and tests #109

Closed dongahn closed 8 years ago

dongahn commented 8 years ago

Add support to correctly identify and target the controlling brokers to spawn a job, a fix needed to introduce necessary tests. In addition, this PR includes some refactoring work: i.e., introducing a bridge layer to hide normal vs. emulation scheduling mode as well as a bit better option/argument processing.

Other changes are all testing related:

Note that this PR includes fairly substantial changes, and I will appreciate if folks can help review this to minimize the errors that I probably have introduced.

The description of changes is captured in the commit messages. Commits in reverse chronological order:

I at least made sure each commit builds and passes all tests on LLNL's cab machine. But this PR itself will fail in Travis because two pending flux-core PRs need to be merged into the core to be compatible with the latest changes.

dongahn commented 8 years ago

I realized that the rs2rank testing data (e.g., cab fake hwloc files) included in the last commit precluded github to display testing scripts themselves. I am breaking that commit into two for better review display: first, testing data commit and second testing scripts commit.

dongahn commented 8 years ago

Also, rebased.

SteVwonder commented 8 years ago

:+1: The emulator-related and arg-parsing changes look good. Centralizing the emulator-specific branches was a good idea.

dongahn commented 8 years ago

Thanks @SteVwonder.

lipari commented 8 years ago

Very impressive! Nice work! Did the build fail because flux-core PR 484 hasn't been merged yet?

dongahn commented 8 years ago

Thanks. This needs flux-core PR#484 and #506 to be merged to pass Travis CI.

dongahn commented 8 years ago

@lipari: As I was testing the branch for this PR along with the latest change in my other flux-core PR #506, I hit an intermittent crash within the resrc unit test. I took a liberty to push a minor fix into this PR and also to push some other minor changes to t/t1003-stress.t as well, which will help with my next step.

Sorry to push the changes without giving you a heads-up first. I just thought those were minor enough... In any case, I will have this PR frozen from now on until you or others suggest further changes for improvement.

dongahn commented 8 years ago

Hmmm. Somehow, jsc and waitjob sharness tests fail on the latest flux-core while rs2rank tests all succeed... I will take a look as soon as practical.

dongahn commented 8 years ago

Why is dump_resrc_state() called unconditionally? It is an expensive and wasteful function call if the log level is not set to LOG_DEBUG.

Same comment as above: call only if the log level is set to debug.

Good catch @lipari. I can see two ways to address this. 1) perhaps use log_get_level and add a conditional before calling dump_resrc_state() or 2) actually add a new switch to the schedsrv itself (e.g., debug-resrc) and add a conditional based on that switch. I have a slight preference to the second because even under LOG_DEBUG, this might produce too much data at large scale... Thoughts?

lipari commented 8 years ago

Good catch @lipari. I can see two ways to address this. 1) perhaps use log_get_level and add a conditional before calling dump_resrc_state() or 2) actually add a new switch to the schedsrv itself (e.g., debug-resrc) and add a conditional based on that switch. I have a slight preference to the second because even under LOG_DEBUG, this might produce too much data at large scale... Thoughts?

I guess I prefer 1). There are only two places that need the clause and it would be simple. Not knowing exactly where the new case would go, I fear that it would introduce unnecessary obfuscation.

dongahn commented 8 years ago

@lipari: I see your point, but my concern was this routine would print a large volume of debug prints with a large-scale resrc configuration, which can dwarf other debug statements (prints coming out of LOG_DEBUG)...

I could actually get rid of the calling statements of dump_resrc_state () once and for all, and only sprinkle this to certain points in the code when I want to debug a resrc tree for my development. My worry was, though, when there is no invocation to this static function, Travis compilers won't like it so the preference was 2).

Sorry that I've created the situation in the first place. If there are different LOG_DEBUG levels, I could also enable this only at the highest level in LOG_DEBUG...

lipari commented 8 years ago

@dongahn, I defer to your judgement on how to rework the debug output.

dongahn commented 8 years ago

Thanks @lipari. I'm looking at it right now.

dongahn commented 8 years ago

@lipari: I've added some more debug prints to debug why jsc and waitjob tests are failing on Travis. I believe it is failing because hwloc reader mode somehow truncates the hostname coming out of hwloc in building the resrc tree.

[1453436709.295181] sched.info[0]: hostname from hwloc testing-worker-linux-docker-006027f9-3209-linux-3

<CUT> 

[1453436709.295248] sched.info[0]: resrc type: node, path: /cluster/testing-worker-linux-docker-6027, name: testing-worker-linux-docker-, digest: 82E447AD74E9D97BE1BC5E4D194D642F03901B7D, id: 6027, state: invalid, uuid: f1afe77e-8434-4230-a6d5-54616a287a64, size: 1, avail: 1

Since the tests are run under a docker container in Travis, the hostname turns out to be pretty long and it seems there is a logic somewhere that splits the hostname into "name" and "id" for an resrc object, which doesn't quite work well for such a convoluted hostname.

The reason this problem only shows up now with my PR is that this PR discards the rdl-specifed resrc tree if it doesn't match with the runtime hwloc data and rebuilds the resrc tree with the hwloc data. So this is the first time where we tested jsc and waitjob tests with hwloc on Travis...

Other new rs2rank tests work because they use fake hwloc xml files taken from cab and thus have well-formed hostname.

dongahn commented 8 years ago

This probably is this code in resrc.c:

 675 static resrc_t *resrc_new_from_xml (xmlNodePtr nodePtr, resrc_t *parent,
 676                                     const char *sig)

<CUT>

 716         if (name) {
 717             /*
 718              * Break apart the hostname to fit the Flux resource model:
 719              * generic name + id
 720              */
 721             for (c = name; *c; c++) {
 722                 if (isdigit(*c)) {
 723                     id = strtol (c, NULL, 10);
 724                     *c = '\0';
 725                     break;
 726                 }
 727             }

Getting the id of a node from the numeric suffix might not be as robust as we want... Perhaps in hwloc reader mode, we should just assign the hostname as a whole the "name" and somehow choose an unique integer as the id? Since the schedsvr fetches the hwloc data in increasing rank order, id can be the equal to the rank in this reader mode?

BTW, how is the id used anyway in flux resource model?

dongahn commented 8 years ago

@lipari: I believe I've addressed all of your great review points above (pushed). For better Travis debugging, I've also registered a log print to the after_failure hook. More of less, the same logic as flux-core.

This will still fail Travis tests because of the hostname issue above. Let's chat about this when you have some time!

lipari commented 8 years ago

Looks good. The model I'm following is the host+id naming convention we follow in our center which was formalized in https://github.com/flux-framework/rfc/blob/master/spec_4.adoc. Let's talk tomorrow.

garlick commented 8 years ago

The model I'm following is the host+id naming convention we follow in our center which was formalized in RFC4

I don't think that's in there is it? It seems like a problem if it is!

lipari commented 8 years ago

The model I'm following is the host+id naming convention we follow in our center which was formalized in RFC4

I don't think that's in there is it? It seems like a problem if it is!

RFC 4 currently states:

"ID (optional numeric ID to be appended to basename to get name)"

This is the inspiration behind the code snippet @dongahn quoted above. Our resource model calls for an ID, and when looking at a node's resources, the hwloc library provides the "os_index" which conveys the physical index number or each resident resource. This is what the resrc.c code uses to populate the ID field of the resrc_t struct for resources on a node (aka hwloc "machine"). This value does not naturally exist for the node (hwloc machine) itself, so I used the next best thing of grabbing the (trailing) numerical portion of the hostname itself - which you see in the code snippet above.

As @dongahn pointed out, when using only the hwloc reader, the id will always be 0, so we probably need to artificially assign something that allows the tests to proceed.

garlick commented 8 years ago

Flux should not break at sites that choose a different naming scheme. That was my main point.

dongahn commented 8 years ago

@garlick: I agree. @lipari, @grondo and I actually had a chat this morning and we're working on a fix -- essentially a variant of my proposal above. I will be testing @lipari's patch soon.

dongahn commented 8 years ago

@grondo: the last test commit suggests that the latest Travis failure occurred because flux somehow couldn't invoke flux-waitjob which resides under sched directory.

Oddly, this seems to only occur under Travis, and as I recall this used to work fine before. I know this was the one of the latest changes you have made. Are there any other options/environments I need to pass to flux to be able to invoke a third-party flux command like waitjob? (in particular under Travis environment?)

For now, when I used the absolute path to flux-waitjob, Travis got back to green so I have a workaround.

flux -v --exec-path /home/travis/build/dongahn/flux-sched/sched waitjob -s wo.st.1 -c wo.end.1 1
flux: `waitjob' is not a flux command.  See 'flux --help'

The larger testing log is available at https://travis-ci.org/dongahn/flux-sched/jobs/104214706

dongahn commented 8 years ago

@lipari: Travis is green again w/ my work around. I will make a minor adjustment to accommodate your hostname fix and move the waitjob invocation work-around commit ahead of rs2rank for this PR.

dongahn commented 8 years ago

OK. I pushed all the changes I wanted to make.

lipari commented 8 years ago

Yay!!! Enjoy your weekend!

dongahn commented 8 years ago

Thanks! The same to you. BTW, it is pretty neat if you have something like the following in the commit msg,

"Close #Issue-number", the issues ticket is automatically closed on merging.