sched gets a "complete" event, but leaves state==submitted

trws commented 8 years ago

I'm not sure how this is happening, but when using sched with cap, I'm noticing it work great for the first 15 or so, and then I stop getting update events from the system. The funny thing about this is that sched seems to get the events, because it marks completed-time in the kvs, but doesn't change the state somehow. The contents of the kvs directory of a task where this has happened looks like this:

lwj.75.cmdline = [ "hostname" ]
lwj.75.ntasks = 1
lwj.75.nnodes = 1
lwj.75.environ = { ... }
lwj.75.cwd = /g/g12/scogland/projects/flux/capacitor
lwj.75.create-time = 2015-09-02T10:37:04
lwj.75.rdl = { "cab2": { "socket0": { "core0": "core" } } }
lwj.75.rank.
lwj.75.starting-time = 2015-09-02T10:37:05
lwj.75.running-time = 2015-09-02T10:37:05
lwj.75.0.
lwj.75.complete-time = 2015-09-02T10:37:05
lwj.75.state = submitted

trws commented 8 years ago

I should note, once a number of these pile up, sched stops accepting new submissions and jobs cease getting run, sitting at their initial submitted state indefinitely.

grondo commented 8 years ago

I think wrexecd sets those silly <state>-time entries:

https://github.com/flux-framework/flux-core/blob/master/src/modules/wreck/wrexecd.c#L866

Perhaps the complete state is being set but then the job gets switched back to submitted by something?

dongahn commented 8 years ago

We should look at this. Could you set up a repro?

dongahn commented 8 years ago

BTW, it will be GREAT if we also discover and resolve all the problems within flux-sched when being stressed with cap as well! One of my worries about sched is that, not having such test cases. And this really seems an opportune time. Off to a meeting for now :-(

grondo commented 8 years ago

Heh, using cap to run the (or a) testsuite is a neat idea...

dongahn commented 8 years ago

This is as if two submits get the same jobid. What submit method is used?

Probably some race. Once we have a repro it seems worthwhile trying the new reliable jsc.

trws commented 8 years ago

Maybe with ats. :P

I really agree though. Once it gets more stable that's a good use case for it.

For this issue, it looks to me like what @grondo pointed out is spot on. I had noticed states oscillating at one point between submitted and reserved without settling down. I was wondering if this was related, maybe a sched state change that only gets committed after wreck is already done, so it overwrites the state? This is another place where having a coherent kvs append would be awesome...

Sent with Good (www.good.com)

From: Mark Grondona Sent: Wednesday, September 02, 2015 12:33:42 PM To: flux-framework/flux-sched Cc: Scogland, Thomas Richard William Subject: Re: [flux-sched] sched gets a "complete" event, but leaves state==submitted (#58)

Heh, using cap to run the (or a) testsuite is a neat idea...

— Reply to this email directly or view it on GitHubhttps://github.com/flux-framework/flux-sched/issues/58#issuecomment-137221936.

grondo commented 8 years ago

Yes, the idea of appending states to a log would really help to see what is going on here....

lipari commented 8 years ago

Is there any chance this problem will go away once https://github.com/flux-framework/flux-core/pull/386 is merged?

trws commented 8 years ago

I've been trying to avoid it to not step on any toes, well and to get direct milestone features done, but I'm about to where I want to just implement initial support for that in the KVS. It's popping up in enough places that it keeps bumping up my priority list.

Sent with Good (www.good.com)

From: Mark Grondona Sent: Wednesday, September 02, 2015 3:18:10 PM To: flux-framework/flux-sched Cc: Scogland, Thomas Richard William Subject: Re: [flux-sched] sched gets a "complete" event, but leaves state==submitted (#58)

Yes, the idea of appending states to a log would really help to see what is going on here....

— Reply to this email directly or view it on GitHubhttps://github.com/flux-framework/flux-sched/issues/58#issuecomment-137261404.

trws commented 8 years ago

That's a really good point. It might.

Question though, does sched directly modify the state entry at any point, or does it let wreck handle it? If sched does and it's depending on JSC updates this might all make sense.

Sent with Good (www.good.com)

From: Don Lipari Sent: Wednesday, September 02, 2015 3:22:28 PM To: flux-framework/flux-sched Cc: Scogland, Thomas Richard William Subject: Re: [flux-sched] sched gets a "complete" event, but leaves state==submitted (#58)

Is there any chance this problem will go away once flux-framework/flux-core#386https://github.com/flux-framework/flux-core/pull/386 is merged?

— Reply to this email directly or view it on GitHubhttps://github.com/flux-framework/flux-sched/issues/58#issuecomment-137262155.

dongahn commented 8 years ago

Before moving on to a new solution, I would love to understand this problem better w/ reproducer. LIke @lipari said, reliable jsc might help. And there could be a legitimate race that cause two consecutive submits to get the same jobid.

dongahn commented 8 years ago

Sched no longer modifies the kvs state directly but does it through JSC. flux-submit does update the state directly, but that's synchronized with the rest of the system because this is done in a lock step fashion: schedsrv won't issue runrequest until submitted event is delivered to it through JSC.

dongahn commented 8 years ago

@trws: my guess is you don't use flux-submit in cap to submit the jobs, but you have your own command to submit the ats jobs rapidly, right? This is okay; I just want to understand the testing environment as to how you hook cap to sched a bit better.

dongahn commented 8 years ago

I think I found one problem. Since wreckrun is used in cap and it can emit "pending" state, this can confuse schedsrv. schedsrv is using "pending" (J_PENDING) as an internal state of job, which it never emits back to kvs, so observing the same "pending" coming from outside definitely will confuse its finite state machine.

Since flux-sched never used wreckrun to launch a job, this is a new case to be tested and corrected.

I expect lots of problems like this as we do wider integration of software; cap and dynamic scheduling project seems perfect to harden flux-sched as well.

trws commented 8 years ago

I'm not sure quite what you mean Dong. Cap uses the wreck service to launch jobs only if it is using its own scheduler. When using the sched backend it uses the job module to create the job, then sets it to the submitted state, just like flux-submit does. Well, unless I messed something up, which is also possible. It relies on flux-sched to invoke the job once it finds available resources.

From: Dong H. Ahn [notifications@github.com] Sent: Wednesday, September 02, 2015 6:00 PM To: flux-framework/flux-sched Cc: Scogland, Thomas Richard William Subject: Re: [flux-sched] sched gets a "complete" event, but leaves state==submitted (#58)

I think I found one problem. Since wreckrun is used in cap and it can emit "pending" state, this can confuse schedsrv. schedsrv is using "pending" (J_PENDING) as an internal state of job, which it never emits back to kvs, so observing the same "pending" coming from outside definitely will confuse its finite state machine.

Since flux-sched never used wreckrun to launch a job, this is a new case to be tested and corrected.

I expect lots of problems like this as we do wider integration of software; cap and dynamic scheduling project seems perfect to harden flux-sched as well.

— Reply to this email directly or view it on GitHubhttps://github.com/flux-framework/flux-sched/issues/58#issuecomment-137288270.

dongahn commented 8 years ago

Oh well, sorry. I need to understand how cap works with sched backend better then. I do think it is very useful at this point to understand who emits job states. So far, I see this in

flux-sched: flux-submit flux-core: flux-wreckrun flux-core: wrexec flux-core: job
flux-capacitor

Anything else?

We probably want to make sure we use job state events consistently across these users and these emitters.

dongahn commented 8 years ago

Also, if I'm not mistaken, it seems schedsrv doesn't map an allocated hostname to the specific broker rank so it will simply use the first k ranks to satisfy -N k... This should also be looked at and have a automatic test case.

I'm still lost as to why the state has been overwritten to submitted (if this is indeed a case). schedsrv doesn't write this particular state for a job... Once the reliable jsc PR gets merged, we probably have a cleaner slab to look into this.

dongahn commented 8 years ago

Probably, related. On the latest branch (reliable jsc), I created a simple stress test case:

#!/bin/sh

for i in {1..5}
do
   /nfs/tmp2/dahn/FLUXDEV/flux-core/src/cmd/flux -M/nfs/tmp2/dahn/FLUXDEV/flux-sched/sched -C"/nfs/tmp2/dahn/FLUXDEV/flux-sched/rdl/?.so" -L"/nfs/tmp2/dahn/FLUXDEV/flux-sched/rdl/?.lua" -x/nfs/tmp2/dahn/FLUXDEV/flux-sched/sched submit -N3 -n3 sleep 1

<CUT>

end

At the exactly 100th job, the scheduler stops running the program:

lt-flux-broker: [1441263018.616261] sched.debug[0] job 100 runrequest
lt-flux-broker: [1441263018.616348] sched.debug[0] attempting job 74 state change from running to complete
lt-flux-broker: [1441263018.616428] sched.debug[0] attempting job 100 state change from allocated to runrequest
lt-flux-broker: [1441263018.618993] lwj.99.debug[2] lwj.99: node2: basis=2
lt-flux-broker: [1441263018.619779] lwj.99.info[2] lwj 99: node2: nprocs=1, nnodes=3, cmdline=[ "sleep", "1" ]
lt-flux-broker: [1441263018.620010] lwj.99.debug[2] reading lua files from /nfs/tmp2/dahn/FLUXDEV/flux-core/src/modules/wreck/lua.d/*.lua
lt-flux-broker: [1441263018.622384] lwj.99.info[1] lwj 99: node1: nprocs=1, nnodes=3, cmdline=[ "sleep", "1" ]
lt-flux-broker: [1441263018.622767] lwj.99.debug[1] reading lua files from /nfs/tmp2/dahn/FLUXDEV/flux-core/src/modules/wreck/lua.d/*.lua
lt-flux-broker: [1441263018.625809] lwj.99.debug[2] task0: pid 200683 (sleep): started
lt-flux-broker: [1441263018.628257] lwj.99.debug[1] task0: pid 200690 (sleep): started
lt-flux-broker: [1441263018.629997] lwj.99.debug[0] updating job state to running
lt-flux-broker: [1441263018.632618] sched.debug[0] attempting job 99 state change from starting to running
lt-flux-broker: [1441263018.633606] job.info[0] got request job.create
lt-flux-broker: [1441263018.634594] job.info[0] Setting job 101 to reserved
lt-flux-broker: [1441263018.638874] sched.debug[0] attempting job 101 state change from null to null
lt-flux-broker: [1441263018.638957] sched.debug[0] attempting job 101 state change from null to reserved
lt-flux-broker: [1441263018.640427] sched.debug[0] attempting job 101 state change from reserved to submitted
lt-flux-broker: [1441263018.640883] job.info[0] got request job.disconnect
lt-flux-broker: [1441263018.641141] sched.debug[0] extract lwj.101.nnodes: 3
lt-flux-broker: [1441263018.641291] sched.debug[0] extract lwj.101.ntasks: 3
lt-flux-broker: [1441263018.642016] lwj.100.debug[0] initializing from CMB: rank=0
lt-flux-broker: [1441263018.642610] lwj.100.debug[0] lwj.100: node0: basis=0
lt-flux-broker: [1441263018.642890] lwj.100.emerg[0] Failed to get resources for this node
lt-flux-broker: Error reading status from rexecd: Success
lt-flux-broker: [1441263018.644250] lwj.100.debug[2] initializing from CMB: rank=2
lt-flux-broker: [1441263018.645013] lwj.100.emerg[2] Failed to get ncores for node0
lt-flux-broker: Error reading status from rexecd: Success

lwj.99.cmdline = [ "sleep", "1" ]
lwj.99.nnodes = 3
lwj.99.cwd = /nfs/tmp2/dahn/FLUXDEV/flux-sched/sched
lwj.99.ntasks = 3
lwj.99.environ = { <CUT>
lwj.99.create-time = 2015-09-02T23:50:18
lwj.99.rdl = { "hype240": { "socket0": { "core0": "core" } }, "hype241": { "socket0": { "core0": "core" } }, "hype242": { "socket0": { "core0": "core" } } }
lwj.99.rank.0.cores = 1
lwj.99.rank.1.cores = 1
lwj.99.rank.2.cores = 1
lwj.99.starting-time = 2015-09-02T23:50:18
lwj.99.running-time = 2015-09-02T23:50:18
lwj.99.0.procdesc = { "command": "sleep", "pid": 200679, "nodeid": 0 }
lwj.99.0.exit_status = 0
lwj.99.0.exit_code = 0
lwj.99.0.stdout.000000 = { "eof": true }
lwj.99.0.stderr.000000 = { "eof": true }
lwj.99.1.procdesc = { "command": "sleep", "pid": 200690, "nodeid": 1 }
lwj.99.1.exit_status = 0
lwj.99.1.exit_code = 0
lwj.99.1.stdout.000000 = { "eof": true }
lwj.99.1.stderr.000000 = { "eof": true }
lwj.99.2.procdesc = { "command": "sleep", "pid": 200683, "nodeid": 2 }
lwj.99.2.exit_status = 0
lwj.99.2.exit_code = 0
lwj.99.2.stdout.000000 = { "eof": true }
lwj.99.2.stderr.000000 = { "eof": true }
lwj.99.state = complete
lwj.99.complete-time = 2015-09-02T23:50:19

lwj.100.cmdline = [ "sleep", "1" ]
lwj.100.nnodes = 3
lwj.100.cwd = /nfs/tmp2/dahn/FLUXDEV/flux-sched/sched
lwj.100.ntasks = 3
lwj.100.environ = {  <CUT> }
lwj.100.create-time = 2015-09-02T23:50:18
lwj.100.rdl = { "hype243": { "socket0": { "core0": "core" } }, "hype244": { "socket0": { "core0": "core" } }, "hype245": { "socket0": { "core0": "core" } } }
lwj.100.rank.0.core = 1
lwj.100.rank.1.core = 1
lwj.100.rank.2.core = 1
lwj.100.state = runrequest

Several interesting observations:

This is deterministically fails at 100th job submission with ''' lt-flux-broker: [1441263018.642890] lwj.100.emerg[0] Failed to get resources for this node lt-flux-broker: Error reading status from rexecd: Success '''
How was FCFS able to schedule and issue a large number (99) of one-second sleep jobs so quickly before hitting this failure.
I see some of the later jobs (id=125) actually stays at submitted, but this is different than Tom's case because it doesn't have complete time.

trws commented 8 years ago

To go back a couple of comments, at least as I understand it, this is the exact set of mappings of state sets:

Create job in KVS, initially set pending: flux-core: job

Set job state to each stage in the running order: flux-core: wrexec

Submit the job to sched, currently done by setting state to "submitted" in the kvs, because we don't have a proper job submission API: flux-sched: flux-submit flux-capacitor (with -S sched)

In the non-sched case, these do not directly set the state for a job, but they cause it to be set by wreck: flux-core: flux-wreckrun flux-capacitor (without -S sched)

As to the most recent, you might be seeing one part of the same problem @dongahn. Remember I mentioned it would stop accepting jobs at some point, so that lines up with what I was seeing. If you want to try exactly the case I was looking at, you can use the file in this gist.

dongahn commented 8 years ago

@trws great!

Create job in KVS, initially set pending: flux-core: job

You didn't mean to say job module set "pending," right? If it were, this will confuse the FSM of sched.

trws commented 8 years ago

I thought that's what it set it to, but I was wrong, it's set to reserved. That's interesting to me, @lipari, am I reading the code right that when the job is created sched immediately picks it up rather than waiting for the state to be set to submitted? At least in terms of attempting fill_resource_req I mean? Actually... There's also no break between those switch cases, so if I'm reading this right, when the job is created it is immediately reserved and scheduled without waiting for the user to ask?

lipari commented 8 years ago

@trws, that's probably wrong. I'd have to search back through the code to discover how fill_resource_req wound up under the "reserved" case. In any case, there really should be no action on the part of the scheduler until the "submitted" state is reached. The "reserved" state should probably be a no-op and have a break, not fall through. Let me test this theory...

dongahn commented 8 years ago

I suspect, the fall-throughs are ok. When the scheduler sees submitted when the job's current state is "reserved", then this will call fill_resource_req as an action and then fall through a set of internal state changes. Until the scheduler plugin ends up selecting matching RDL resources, the FSM will move the state between J_PENDING and J_SCHEDREQ. (All internal scheduler job states which are not visiable to lwj..state.

trws commented 8 years ago

Yeah, I think I was reading that wrong, it's going based on oldstate rather than newstate. That still begs the question of how wreck could run the job and still have the status set to submitted however. =/

lipari commented 8 years ago

Apologies to you both. I read the code too quickly. There does appear to be a problem with the latest mods. My tests are failing. I am currently looking into why...

lipari commented 8 years ago

My running theory is that with the new changes, flux-submit is bypassing the jsc and setting the submitted state directly in the kvs. This flies under the radar and the sched never gets the state change to submitted.

dongahn commented 8 years ago

@lipari: Could you please look at the new PR, which should remove all the concerns? It would be nice to do this investigation at that revision level (with reliable jsc). The test results up there was done at that level.

dongahn commented 8 years ago

If this PR solves this inconsistent state change issue, then the next thing to look at will be, Error reading status from rexecd: Success issue.

lt-flux-broker: [1441263018.641141] sched.debug[0] extract lwj.101.nnodes: 3
lt-flux-broker: [1441263018.641291] sched.debug[0] extract lwj.101.ntasks: 3
lt-flux-broker: [1441263018.642016] lwj.100.debug[0] initializing from CMB: rank=0
lt-flux-broker: [1441263018.642610] lwj.100.debug[0] lwj.100: node0: basis=0
lt-flux-broker: [1441263018.642890] lwj.100.emerg[0] Failed to get resources for this node
lt-flux-broker: Error reading status from rexecd: Success

lipari commented 8 years ago

@dongahn, Looks like my theory was correct and you have a fix already in place. So, now I will review the remaining commits to https://github.com/flux-framework/flux-sched/pull/59 and then return to this issue.

grondo commented 8 years ago

"Error reading status from rexecd" is directly due to no resources being assigned to the rank or ranks, and fatal termination of wrexecd.

dongahn commented 8 years ago

@grondo Great info! Then I understand this problem completely! It is a buffer overflow in jsc, using too short of a buffer size to hold lwj.100.rank.0.cores. Look at one last character (s) missing at job # 100.

''' lwj.100.rank.0.core = 1 lwj.100.rank.1.core = 1 lwj.100.rank.2.core = 1 '''

I will create a patch shortly and run this stress test more time. Assuming this will fix this particular issue, the two potential issues to look at:

How was FCFS able to schedule and issue a large number (99) of one-second sleep jobs so quickly before hitting this failure.
absence of hostname - rank mapping

dongahn commented 8 years ago

Meant to say sorry.

dongahn commented 8 years ago

BTW, do we have PATH_MAX macro equivalent for KVS PATH? (e.g., KVS_PATH_MAX?)

garlick commented 8 years ago

There are no limits on KVS key or value sizes. Well, not hardwired anyway.

dongahn commented 8 years ago

OK, I issue a PR (#391) to flux-core for this. If @lipari wants to try this patch before that's merged, it's available in my jsc_keylen_fix branch. My brief testing shows this seems to fix the "failure at 100th job" issue above. (Now need to move onto a different project unfortunately.)

lipari commented 8 years ago

Thanks, @dongahn. I'll take a look.

trws commented 8 years ago

@dongahn, so you know, we have the hostname and interface information for every rank, so we can disambiguate them when/if we want, but for now it seemed reasonable to leave it be. At some point we may want to have that be an option, but I could see cases where users would want two brokers, each in a different container, on the same node, in which case they actually should be separate. (this is part of why resource_hwloc leaves them separate).

On 3 Sep 2015, at 14:18, Dong H. Ahn wrote:

@grondo Great info! Then I understand this problem completely! It is a buffer overflow in jsc, using too short of a buffer size to hold lwj.99.rank.0.cores. Look at one last character (s) missing at job # 100.

''' lwj.100.rank.0.core = 1 lwj.100.rank.1.core = 1 lwj.100.rank.2.core = 1 '''

I will create a patch shortly and run this stress test more time. Assuming this will fix this particular issue, the two potential issues to look at:

How was FCFS able to schedule and issue a large number (99) of one-second sleep jobs so quickly before hitting this failure.

absence of hostname - rank mapping

Reply to this email directly or view it on GitHub: https://github.com/flux-framework/flux-sched/issues/58#issuecomment-137577803

dongahn commented 8 years ago

@trws thank you for the info and i think i understand. The issue is however that even if you launch 4 brokers and then submit each job with 2 nodes back ti bakc, the sched as is will always map these 2-node jobs only to the first two ranks not using the other two ranks.

For testing purpose i think we may need to generate an RDL where node name is simply rank and then modify the sched to construct a correct rdl_contain JCB before updating it via jsc.

trws commented 8 years ago

Ah, I didn't realize it was leaving nodes unused... that's clearly broken. Does this happen even if you specify the RDL as having one "node" for each rank? That's what I was doing, but I didn't think to check if sched was actually using all of the resources.

dongahn commented 8 years ago

I think so. I think the general problem may be sched doesn't useu some of the rdl info for lauching/execution, which is understandable because we have developed each sched component separately without too many test cases needed for integration testing. But i think this is the right time.

What would be ideal is to generate an rdl from resource hwloc for the bootstrap case and then make a correction to sched to actually use rdl info fully for scheduling and this sould be safe.because the generated rdl would be consistent with the instance.

trws commented 8 years ago

I agree it's time, but it feels unfortunate to generate it like that. Generating lua code from kvs values to load them feels... off. As a temporary fix I suppose it can't hurt, and it wouldn't be that hard to generate, but I would say lets have it be an external script or something we use until either we get a real resource query interface or at least pick a long-term resource/spec file format.

dongahn commented 8 years ago

An external script as a temporary solution sounds great. Much of this is very needed experience to be anle to design and reactor SW for the production versions any way...

grondo commented 8 years ago

In the current design, at least, when associating resources with a program there must be a bit of rdl config written to the job's kvs, so that the child instance can pick up that bit of config for its own instance. Only the bootstrap-type instances should read config directly from a file.

The "rdl config" could be any form of serialized rdl -- certainly not the RDL.lua config language, but any kind of serialized rdl that could be read by the child scheduler.

I actually haven't looked at the sched internal representation of resources much, but for now would it make sense to have multiple "config readers" -- existing one for lua config language, one for walking hwloc info from kvs in a test bootstrap instance, and one for reading simplified rdl from kvs in child programs?

This allows extensions later as well where new resource configuration schemes could be introduced along with a sched "plugin" to read them. However, the internal sched representation and serialized format remain the "common" language that is read by child instances.

Also see the need for this in #60.

(I may be missing the point here though. I don't really understand @trws comment about generating lua code from kvs values..)

trws commented 8 years ago

I agree with all of that @grondo. Being pretty out of it last night I didn't put it well, but all I meant was that taking something we already had in the KVS and turning it into lua code that the current RDL loader could injest felt like a step backward rather than forward.

I was going to bring the serialization format up at some point also. After the discussions on resource/job-specs, I've been finding that I kinda like yaml for a human-readable format (originally snagged it because it takes less typing, but it seems to help), and come to find out it has built-in support for explicit typing and back-references to represent pointers to common objects at different points in the tree as well as local inheritance and specialization. It actually seems like that might be a good way to set up at least the human-readable RDL serialization since it can directly represent all the information we need, including non-hierarchical links, it's also a proper superset of JSON (well, it is now anyway), so we could keep backwards compatibility everywhere if we just switch to taking it where we take json...

lipari commented 8 years ago

The need for multiple resource readers was identified and entered as https://github.com/flux-framework/flux-sched/issues/33. The scheduler currently serializes the allocated resources in the json format. It looks like this in the KVS:

lwj.1.rdl = { "ipa1": { "socket0": { "core0": "core" } }, "ipa2": { "socket0": { "core0": "core" } }, "ipa3": { "socket0": { "core0": "core" } } }

grondo commented 8 years ago

@trws, yeah the references feature of yaml would be great for rdl, especially if we serialize multiple hierarchies/graphs of the same resources. (or resources could always be referenced by uuid)

grondo commented 8 years ago

@lipari: That looks pretty good. It could be a starting point for configuration language for a new sched instance? (Seems like parent information up to root is missing, but easily added?)

lipari commented 8 years ago

From the discussion so far, how about three options for the flux module load sched rdl-conf=?

The existing rdl config file if provided
resource.hwloc from the kvs (if specified somehow)
or, if no option is provided, the lwj..rdl from the confining instance.

With one of these three options available, I'm not sure about the need for supporting a serialized option to the rdl-conf= option.

lipari commented 8 years ago

Further thought... Is it safe to assume that the contents of the second and third options above should be the same? I.e., will (should?) the resource.hwloc eventually be pared down to just those resources the instance is running in.

flux-framework / flux-sched

sched gets a "complete" event, but leaves state==submitted #58