flux-framework / flux-sched

Fluxion Graph-based Scheduler
GNU Lesser General Public License v3.0
84 stars 39 forks source link

simulator: use future-based RPC API #246

Closed garlick closed 6 years ago

garlick commented 6 years ago

This just updates the single rpc in flux sched for api changes proposed in flux-framework/flux-core#1089

Travis should be re-run after that gets merged.

dongahn commented 6 years ago

LGTM. @SteVwonder?

SteVwonder commented 6 years ago

Yep. LGTM too once travis gives the ok.

grondo commented 6 years ago

Restarted build since flux-core updates required here are now merged.

garlick commented 6 years ago

Apologies! Missed one. Pushing update.

coveralls commented 6 years ago

Coverage Status

Coverage decreased (-12.4%) to 61.134% when pulling 85a1b49ef1da645229b97a670d3ad2830a7ae373 on garlick:rpc_future into 2b0f5c56e183fcd52549451d56ab758ffbead5c6 on flux-framework:master.

garlick commented 6 years ago

Sorry for the false starts here! I'm debugging these on my desktop right now but need to head out to lunch so will pick it up when I get back:

PASS: t2000-fcfs.t 1 - sim: started successfully
FAIL: t2000-fcfs.t 2 - sim: scheduled and ran all jobs
PASS: t2000-fcfs.t 3 - jobs scheduled in correct order
ERROR: t2000-fcfs.t - exited with status 1
PASS: t2001-fcfs-aware.t 1 - sim: started successfully
FAIL: t2001-fcfs-aware.t 2 - sim: scheduled and ran all jobs
PASS: t2001-fcfs-aware.t 3 - jobs scheduled in correct order
ERROR: t2001-fcfs-aware.t - exited with status 1
PASS: t2002-easy.t 1 - sim: started successfully
FAIL: t2002-easy.t 2 - sim: scheduled and ran all jobs
FAIL: t2002-easy.t 3 - jobs scheduled in correct order
ERROR: t2002-easy.t - exited with status 1
SteVwonder commented 6 years ago

Yeah, unfortunately, the simulator doesn't have any unit tests. So my apologies for the not helpful at all error/testing messages. It is very odd that the scheduling/running of jobs failed but for some of the tests the jobs were scheduled in the correct order. Not sure how the order diff is correct, but the previous test isn't. It is also very odd that the code coverage dropped by 12%. Maybe the RPC is failing and the sim/sched are exiting prematurely?

codecov-io commented 6 years ago

Codecov Report

Merging #246 into master will decrease coverage by 17.3%. The diff coverage is 0%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master     #246       +/-   ##
===========================================
- Coverage   69.63%   52.32%   -17.31%     
===========================================
  Files          25       25               
  Lines        5246     5248        +2     
===========================================
- Hits         3653     2746      -907     
- Misses       1593     2502      +909
Impacted Files Coverage Δ
simulator/simulator.c 1.04% <0%> (-77.2%) :arrow_down:
simulator/simsrv.c 0% <0%> (-73.92%) :arrow_down:
simulator/submitsrv.c 0% <0%> (-84.31%) :arrow_down:
simulator/sim_execsrv.c 0% <0%> (-85.28%) :arrow_down:
sched/sched_backfill.c 14.37% <0%> (-76.48%) :arrow_down:
sched/rsreader.c 75.18% <0%> (-21.17%) :arrow_down:
sched/rs2rank.c 79.13% <0%> (-14.79%) :arrow_down:
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 2b0f5c5...0b82d07. Read the comment docs.

garlick commented 6 years ago

I pushed a couple more commits: one to work with newer czmq, and one bug where an incorrect key was used to pull the start time out of the KVS (please review cfa0bc6 @SteVwonder) , but didn't get to the bottom of why jobs aren't running. We're hitting the 60s timeout on timed_sync_wait_job and it doesn't look like anything has started.

Will keep looking. I'll revert the RPC change in flux-core if I can't find this by tomorrow.

coveralls commented 6 years ago

Coverage Status

Coverage decreased (-12.4%) to 61.134% when pulling cfa0bc68fc3d23c9b36e8d76ff11f125d417dffb on garlick:rpc_future into 2b0f5c56e183fcd52549451d56ab758ffbead5c6 on flux-framework:master.

SteVwonder commented 6 years ago

@garlick, this key should probably have a better, less ambiguous name, but when running with the simexec module loaded, job start times (w.r.t. simulation time) are stored in "starting_time" and the wallclock time is stored in "starting-time". For this test, I believe we want the simulation time, so "starting_time" should be correct.

Let me grab this PR and give it a whirl before you revert changes in flux-core.

garlick commented 6 years ago

Ah sorry about that.

garlick commented 6 years ago

On my desktop I backed up flux core to just before RPC changes in PR #1089 and the same tests are failing on flux-sched master. Do we have a known good flux-core? If not I'll bisect.

SteVwonder commented 6 years ago

No worries. I am sorry I chose such a poor name.

I do not know. Maybe @dongahn or @lipari knows?

garlick commented 6 years ago

Checkpointing some results:

So it seems this goes back a ways. Not sure whether the Apr 4 - May 25 failure is the same root cause, or maybe something that didn't show up in travis since the last change merged to flux-sched (Apr 13) presumably passed with flux-core master at the time.

garlick commented 6 years ago

Presumably the flux-core that built in travis against current flux-sched master would have been at PR 1032 (Apr 13). This fails for me as well using flux-sched master (Apr 13) in the same way as the May 25 failure described above.

flux-start: warning: setting --bootstrap=selfpmi due to --size option
flux-start: 0 (pid 3287) Segmentation fault
flux-kvs: flux_open: Connection refused
flux-kvs: flux_open: Connection refused
flux-kvs: flux_open: Connection refused
flux-kvs: flux_open: Connection refused
flux-kvs: flux_open: Connection refused
flux-kvs: flux_open: Connection refused
flux-kvs: flux_open: Connection refused
flux-kvs: flux_open: Connection refused
flux-kvs: flux_open: Connection refused
flux-kvs: flux_open: Connection refused
flux-kvs: flux_open: Connection refused
flux-kvs: flux_open: Connection refused
ok 1 - sim: started successfully
PASS: t2000-fcfs.t 1 - sim: started successfully
139
not ok 2 - sim: scheduled and ran all jobs
FAIL: t2000-fcfs.t 2 - sim: scheduled and ran all jobs
#
#           timed_sync_wait_job 60
#
ok 3 - jobs scheduled in correct order
PASS: t2000-fcfs.t 3 - jobs scheduled in correct order
# failed 1 among 3 test(s)
ERROR: t2000-fcfs.t - missing test plan

I do get a core file

(gdb) bt full
#0  strlen () at ../sysdeps/x86_64/strlen.S:106
No locals.
#1  0x00007fefe84e363a in json_object_set_new ()
   from /usr/lib/x86_64-linux-gnu/libjansson.so.4
No symbol table info available.
#2  0x00007fefb5930adb in Jadd_double (d=<optimized out>, 
    name=0xffffffff00000000 <error: Cannot access memory at address 0xffffffff00000000>, o=0x7fefac00e4e0) at ../src/common/libutil/shortjansson.h:76
        n = <optimized out>
#3  add_timers_to_json (o=0x7fefac00e4e0, event_time=0x7fefac00e700, 
    key=0xffffffff00000000 <error: Cannot access memory at address 0xffffffff00000000>) at simulator.c:63
No locals.
#4  sim_state_to_json (sim_state=0x7fefac00c9b0) at simulator.c:76
        o = 0x7fefac00f7f0
        event_timers = 0x7fefac00e4e0
        item = 0x7fefac00e700
#5  0x00007fefb592a653 in send_trigger (sim_state=0x7fefac00c9b0, 
    mod_name=0x7fefac00d050 "submit", h=0x7fefac001590) at simsrv.c:82
        rc = 0
        msg = 0x0
        o = 0x0
        topic = 0x0
#6  handle_next_event (ctx=0x7fefac00ca20, ctx=0x7fefac00ca20) at simsrv.c:197
        timers = 0x7fefac00db40
        keys = 0x7fefac00e610
        sim_state = 0x7fefac00c9b0
        rc = 0
        min_event_time = 0x7fefac00e4c0
        curr_event_time = <optimized out>
        mod_name = 0x7fefac00d050 "submit"
        curr_name = <optimized out>
#7  0x00007fefb592ab03 in join_cb (h=0x7fefac001590, w=<optimized out>, 
    msg=<optimized out>, arg=0x7fefac00ca20) at simsrv.c:260
        mod_rank = 0
        request = 0x7fefac00f4f0
        mod_name = 0x7fefac00e5b0 "sched"
        json_str = 0x14e5bd8 "{\"next_event\": -1.0, \"rank\": 0, \"mod_name\": \"sched\"}"
        next_event = 0x0
        ctx = 0x7fefac00ca20
        sim_state = 0x7fefac00c9b0
        size = 1
        __FUNCTION__ = "join_cb"
        timers = <optimized out>
        num_modules = 0
#8  0x00007fefe86f7a74 in call_handler (w=w@entry=0x7fefac00cae0, 
    msg=msg@entry=0x7fefac00d100) at dispatch.c:368
        rolemask = 1
        matchtag = 251658240
#9  0x00007fefe86f8245 in dispatch_message (type=1, msg=0x7fefac00d100, 
    d=0x7fefac00bfa0) at dispatch.c:508
        w = 0x7fefac00cae0
        match = <optimized out>
        rc = -1
#10 handle_cb (r=0x7fefac00b760, hw=<optimized out>, revents=<optimized out>, 
    arg=0x7fefac00bfa0) at dispatch.c:617
        d = 0x7fefac00bfa0
        msg = 0x7fefac00d100
        rc = -1
        type = 1
        match = <optimized out>
        topic = 0x7fefac00e420 "sim.join"
        __FUNCTION__ = "handle_cb"
#11 0x00007fefe8714423 in ev_invoke_pending (loop=0x7fefac00b780) at ev.c:3314
        p = <optimized out>
#12 0x00007fefe8717a7e in ev_run (loop=0x7fefac00b780, flags=0) at ev.c:3717
        flags = 0
        loop = 0x7fefac00b780
#13 0x00007fefe86f6c6d in flux_reactor_run (r=0x7fefac00b760, 
    flags=flags@entry=0) at reactor.c:134
        ev_flags = 0
        count = <optimized out>
#14 0x00007fefb592b0f7 in mod_main (h=0x7fefac001590, argc=<optimized out>, 
    argv=<optimized out>) at simsrv.c:477
        args = <optimized out>
        ctx = 0x7fefac00ca20
        eoc_str = <optimized out>
        exit_on_complete = false
        rank = 0
#15 0x000000000040ac8c in module_thread (arg=0x150e2e0) at module.c:158
        p = 0x150e2e0
        __PRETTY_FUNCTION__ = "module_thread"
        signal_set = {__val = {18446744067267100671, 
            18446744073709551615 <repeats 15 times>}}
        errnum = <optimized out>
        uri = 0x7fefac000930 "shmem://F390406E14102E512CE91D9D3FF4B89C"
        av = 0x7fefac00c830
        rankstr = 0x7fefac005810 "0"
        ac = <optimized out>
        mod_main_errno = 0
        msg = <optimized out>
#16 0x00007fefe7bbc6ba in start_thread (arg=0x7fefb52e9700)
    at pthread_create.c:333
        __res = <optimized out>
        pd = 0x7fefb52e9700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140667513640704, 
                2600844154381796900, 0, 140722778085487, 140667513641408, 
                22078768, -2610003489399241180, -2609903547941255644}, 
              mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, 
            data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
        __PRETTY_FUNCTION__ = "start_thread"
#17 0x00007fefe76ee82d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
No locals.
SteVwonder commented 6 years ago

Loading the modules by hand, it seems that the sim.start event sent by the sim module is received properly by sched, simexec, and submit. Those modules then each individually send out sim.join request messages, which don't seem to be received by the sim module. After our meetings today, I can step through with GDB.

garlick commented 6 years ago

I'm running out of time, but the stack trace I posted above points to the code I changed for newer czmq.
Possibly I"m chasing something self-inflicted here, and the real problem (that occurs in travis) was introduced between flux core pr 1082 (May 31) and 1079 (May 25).

Nothing jumps out at me in that interval, though there is some reactor work in there.

garlick commented 6 years ago

OK, I dropped the incorrect KVS key name change, and added a commit to temporarily disable the failing tests, referencing #249, which I opened to track the root cause of the failure which is unrelated to this PR.

coveralls commented 6 years ago

Coverage Status

Coverage decreased (-18.06%) to 55.424% when pulling 0b82d071e404d6e84439d027326ad7caad817275 on garlick:rpc_future into 2b0f5c56e183fcd52549451d56ab758ffbead5c6 on flux-framework:master.

garlick commented 6 years ago

CI passed, although coverage took a ding with the tests disabled, hence the red x. I think this is ready to be considered for a merge.

SteVwonder commented 6 years ago

@garlick, as we discussed in group meeting today, I'll merge this now and take a look at the simulator failure asynchronously. Hopefully I'll have a PR up by the end of the week that fixes https://github.com/flux-framework/flux-sched/issues/249.

garlick commented 6 years ago

Thanks! I didn't mean to dump it all on you so please let me know how I can help.