Closed garlick closed 6 years ago
LGTM. @SteVwonder?
Yep. LGTM too once travis gives the ok.
Restarted build since flux-core updates required here are now merged.
Apologies! Missed one. Pushing update.
Sorry for the false starts here! I'm debugging these on my desktop right now but need to head out to lunch so will pick it up when I get back:
PASS: t2000-fcfs.t 1 - sim: started successfully
FAIL: t2000-fcfs.t 2 - sim: scheduled and ran all jobs
PASS: t2000-fcfs.t 3 - jobs scheduled in correct order
ERROR: t2000-fcfs.t - exited with status 1
PASS: t2001-fcfs-aware.t 1 - sim: started successfully
FAIL: t2001-fcfs-aware.t 2 - sim: scheduled and ran all jobs
PASS: t2001-fcfs-aware.t 3 - jobs scheduled in correct order
ERROR: t2001-fcfs-aware.t - exited with status 1
PASS: t2002-easy.t 1 - sim: started successfully
FAIL: t2002-easy.t 2 - sim: scheduled and ran all jobs
FAIL: t2002-easy.t 3 - jobs scheduled in correct order
ERROR: t2002-easy.t - exited with status 1
Yeah, unfortunately, the simulator doesn't have any unit tests. So my apologies for the not helpful at all error/testing messages. It is very odd that the scheduling/running of jobs failed but for some of the tests the jobs were scheduled in the correct order. Not sure how the order diff is correct, but the previous test isn't. It is also very odd that the code coverage dropped by 12%. Maybe the RPC is failing and the sim/sched are exiting prematurely?
Merging #246 into master will decrease coverage by
17.3%
. The diff coverage is0%
.
@@ Coverage Diff @@
## master #246 +/- ##
===========================================
- Coverage 69.63% 52.32% -17.31%
===========================================
Files 25 25
Lines 5246 5248 +2
===========================================
- Hits 3653 2746 -907
- Misses 1593 2502 +909
Impacted Files | Coverage Δ | |
---|---|---|
simulator/simulator.c | 1.04% <0%> (-77.2%) |
:arrow_down: |
simulator/simsrv.c | 0% <0%> (-73.92%) |
:arrow_down: |
simulator/submitsrv.c | 0% <0%> (-84.31%) |
:arrow_down: |
simulator/sim_execsrv.c | 0% <0%> (-85.28%) |
:arrow_down: |
sched/sched_backfill.c | 14.37% <0%> (-76.48%) |
:arrow_down: |
sched/rsreader.c | 75.18% <0%> (-21.17%) |
:arrow_down: |
sched/rs2rank.c | 79.13% <0%> (-14.79%) |
:arrow_down: |
... and 4 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 2b0f5c5...0b82d07. Read the comment docs.
I pushed a couple more commits: one to work with newer czmq, and one bug where an incorrect key was used to pull the start time out of the KVS (please review cfa0bc6 @SteVwonder) , but didn't get to the bottom of why jobs aren't running. We're hitting the 60s timeout on timed_sync_wait_job
and it doesn't look like anything has started.
Will keep looking. I'll revert the RPC change in flux-core if I can't find this by tomorrow.
@garlick, this key should probably have a better, less ambiguous name, but when running with the simexec module loaded, job start times (w.r.t. simulation time) are stored in "starting_time" and the wallclock time is stored in "starting-time". For this test, I believe we want the simulation time, so "starting_time" should be correct.
Let me grab this PR and give it a whirl before you revert changes in flux-core.
Ah sorry about that.
On my desktop I backed up flux core to just before RPC changes in PR #1089 and the same tests are failing on flux-sched master. Do we have a known good flux-core? If not I'll bisect.
No worries. I am sorry I chose such a poor name.
I do not know. Maybe @dongahn or @lipari knows?
Checkpointing some results:
So it seems this goes back a ways. Not sure whether the Apr 4 - May 25 failure is the same root cause, or maybe something that didn't show up in travis since the last change merged to flux-sched (Apr 13) presumably passed with flux-core master at the time.
Presumably the flux-core that built in travis against current flux-sched master would have been at PR 1032 (Apr 13). This fails for me as well using flux-sched master (Apr 13) in the same way as the May 25 failure described above.
flux-start: warning: setting --bootstrap=selfpmi due to --size option
flux-start: 0 (pid 3287) Segmentation fault
flux-kvs: flux_open: Connection refused
flux-kvs: flux_open: Connection refused
flux-kvs: flux_open: Connection refused
flux-kvs: flux_open: Connection refused
flux-kvs: flux_open: Connection refused
flux-kvs: flux_open: Connection refused
flux-kvs: flux_open: Connection refused
flux-kvs: flux_open: Connection refused
flux-kvs: flux_open: Connection refused
flux-kvs: flux_open: Connection refused
flux-kvs: flux_open: Connection refused
flux-kvs: flux_open: Connection refused
ok 1 - sim: started successfully
PASS: t2000-fcfs.t 1 - sim: started successfully
139
not ok 2 - sim: scheduled and ran all jobs
FAIL: t2000-fcfs.t 2 - sim: scheduled and ran all jobs
#
# timed_sync_wait_job 60
#
ok 3 - jobs scheduled in correct order
PASS: t2000-fcfs.t 3 - jobs scheduled in correct order
# failed 1 among 3 test(s)
ERROR: t2000-fcfs.t - missing test plan
I do get a core file
(gdb) bt full
#0 strlen () at ../sysdeps/x86_64/strlen.S:106
No locals.
#1 0x00007fefe84e363a in json_object_set_new ()
from /usr/lib/x86_64-linux-gnu/libjansson.so.4
No symbol table info available.
#2 0x00007fefb5930adb in Jadd_double (d=<optimized out>,
name=0xffffffff00000000 <error: Cannot access memory at address 0xffffffff00000000>, o=0x7fefac00e4e0) at ../src/common/libutil/shortjansson.h:76
n = <optimized out>
#3 add_timers_to_json (o=0x7fefac00e4e0, event_time=0x7fefac00e700,
key=0xffffffff00000000 <error: Cannot access memory at address 0xffffffff00000000>) at simulator.c:63
No locals.
#4 sim_state_to_json (sim_state=0x7fefac00c9b0) at simulator.c:76
o = 0x7fefac00f7f0
event_timers = 0x7fefac00e4e0
item = 0x7fefac00e700
#5 0x00007fefb592a653 in send_trigger (sim_state=0x7fefac00c9b0,
mod_name=0x7fefac00d050 "submit", h=0x7fefac001590) at simsrv.c:82
rc = 0
msg = 0x0
o = 0x0
topic = 0x0
#6 handle_next_event (ctx=0x7fefac00ca20, ctx=0x7fefac00ca20) at simsrv.c:197
timers = 0x7fefac00db40
keys = 0x7fefac00e610
sim_state = 0x7fefac00c9b0
rc = 0
min_event_time = 0x7fefac00e4c0
curr_event_time = <optimized out>
mod_name = 0x7fefac00d050 "submit"
curr_name = <optimized out>
#7 0x00007fefb592ab03 in join_cb (h=0x7fefac001590, w=<optimized out>,
msg=<optimized out>, arg=0x7fefac00ca20) at simsrv.c:260
mod_rank = 0
request = 0x7fefac00f4f0
mod_name = 0x7fefac00e5b0 "sched"
json_str = 0x14e5bd8 "{\"next_event\": -1.0, \"rank\": 0, \"mod_name\": \"sched\"}"
next_event = 0x0
ctx = 0x7fefac00ca20
sim_state = 0x7fefac00c9b0
size = 1
__FUNCTION__ = "join_cb"
timers = <optimized out>
num_modules = 0
#8 0x00007fefe86f7a74 in call_handler (w=w@entry=0x7fefac00cae0,
msg=msg@entry=0x7fefac00d100) at dispatch.c:368
rolemask = 1
matchtag = 251658240
#9 0x00007fefe86f8245 in dispatch_message (type=1, msg=0x7fefac00d100,
d=0x7fefac00bfa0) at dispatch.c:508
w = 0x7fefac00cae0
match = <optimized out>
rc = -1
#10 handle_cb (r=0x7fefac00b760, hw=<optimized out>, revents=<optimized out>,
arg=0x7fefac00bfa0) at dispatch.c:617
d = 0x7fefac00bfa0
msg = 0x7fefac00d100
rc = -1
type = 1
match = <optimized out>
topic = 0x7fefac00e420 "sim.join"
__FUNCTION__ = "handle_cb"
#11 0x00007fefe8714423 in ev_invoke_pending (loop=0x7fefac00b780) at ev.c:3314
p = <optimized out>
#12 0x00007fefe8717a7e in ev_run (loop=0x7fefac00b780, flags=0) at ev.c:3717
flags = 0
loop = 0x7fefac00b780
#13 0x00007fefe86f6c6d in flux_reactor_run (r=0x7fefac00b760,
flags=flags@entry=0) at reactor.c:134
ev_flags = 0
count = <optimized out>
#14 0x00007fefb592b0f7 in mod_main (h=0x7fefac001590, argc=<optimized out>,
argv=<optimized out>) at simsrv.c:477
args = <optimized out>
ctx = 0x7fefac00ca20
eoc_str = <optimized out>
exit_on_complete = false
rank = 0
#15 0x000000000040ac8c in module_thread (arg=0x150e2e0) at module.c:158
p = 0x150e2e0
__PRETTY_FUNCTION__ = "module_thread"
signal_set = {__val = {18446744067267100671,
18446744073709551615 <repeats 15 times>}}
errnum = <optimized out>
uri = 0x7fefac000930 "shmem://F390406E14102E512CE91D9D3FF4B89C"
av = 0x7fefac00c830
rankstr = 0x7fefac005810 "0"
ac = <optimized out>
mod_main_errno = 0
msg = <optimized out>
#16 0x00007fefe7bbc6ba in start_thread (arg=0x7fefb52e9700)
at pthread_create.c:333
__res = <optimized out>
pd = 0x7fefb52e9700
now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140667513640704,
2600844154381796900, 0, 140722778085487, 140667513641408,
22078768, -2610003489399241180, -2609903547941255644},
mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0},
data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
not_first_call = <optimized out>
pagesize_m1 = <optimized out>
sp = <optimized out>
freesize = <optimized out>
__PRETTY_FUNCTION__ = "start_thread"
#17 0x00007fefe76ee82d in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
No locals.
Loading the modules by hand, it seems that the sim.start
event sent by the sim
module is received properly by sched
, simexec
, and submit
. Those modules then each individually send out sim.join
request messages, which don't seem to be received by the sim
module. After our meetings today, I can step through with GDB.
I'm running out of time, but the stack trace I posted above points to the code I changed for newer czmq.
Possibly I"m chasing something self-inflicted here, and the real problem (that occurs in travis) was introduced between flux core pr 1082 (May 31) and 1079 (May 25).
Nothing jumps out at me in that interval, though there is some reactor work in there.
OK, I dropped the incorrect KVS key name change, and added a commit to temporarily disable the failing tests, referencing #249, which I opened to track the root cause of the failure which is unrelated to this PR.
CI passed, although coverage took a ding with the tests disabled, hence the red x. I think this is ready to be considered for a merge.
@garlick, as we discussed in group meeting today, I'll merge this now and take a look at the simulator failure asynchronously. Hopefully I'll have a PR up by the end of the week that fixes https://github.com/flux-framework/flux-sched/issues/249.
Thanks! I didn't mean to dump it all on you so please let me know how I can help.
This just updates the single rpc in flux sched for api changes proposed in flux-framework/flux-core#1089
Travis should be re-run after that gets merged.