Open grondo opened 5 years ago
Hmmm, internally, the kvs-watch
module watches for the exact key path being written to. I think the issue may be that you are writing to the key test0
but waiting on the full key path job.ID.guest.test0
. The kvs_copy
that is done underneath the covers by job-exec
copies the namespace to job.ID.guest
, so the creation of the key job.ID.guest.test0
is never seen.
I think the only reason it passes sometimes is because the namespace has already been copied over fully before the --waitcreate
is called (i.e. the job has completely finished and has exited).
I tried doing:
flux kvs get --waitcreate --watch --full -c1 ${kvsdir}.guest.test0
and no longer got any hangs. The --watch --full -c1
basically gets the first value from test0
no matter what keys are specified (it does a full lookup on every setroot event that's sent out). The --waitcreate
simply will wait if the key doesn't exist at first.
Perhaps this is undesirable (it took me some time to figure out) and we need to think of something that is less confusing to users?
Thinking about this a tad more, perhaps allowing --waitcreate
and --full
to be specified together would make things better? B/c right now that isn't allowed. Skimming through code, I think it is supported in the kvs-watch
module, but --waitcreate
and --full
just isn't allowed in the flux-kvs
command.
Now that you reminded me how it works, the result makes sense.
I think the only reason it passes sometimes is because the namespace has already been copied over fully before the --waitcreate is called (i.e. the job has completely finished and has exited).
This makes sense, but my original testcase was more like the script below (we wait to synchronize with the job, then cancel it), and it was sometimes successful:
#!/bin/sh
size=$(flux getattr size)
id=$(flux jobspec srun -N$size \
sh -c 'flux kvs put test${FLUX_TASK_RANK}=started && sleep 60' \
| flux job submit)
kvsdir=$(flux job id --to=kvs $id)
flux kvs get --waitcreate ${kvsdir}.guest.test0
flux job cancel $id
flux job wait-event -v ${id} clean
$ for i in `seq 1 10`; do echo Run $i; ./src/cmd/flux start -s4 sh -x ./test.sh; done
Run 1
flux-start: warning: setting --bootstrap=selfpmi due to --size option
+ flux getattr size
+ size=4
+ flux jobspec srun -N4 sh+ -c flux kvs put test${FLUX_TASK_RANK}=started && sleep 60
flux job submit
+ id=7600078848
+ flux job id --to=kvs 7600078848
+ kvsdir=job.0000.0001.c500.0000
+ flux kvs get --waitcreate job.0000.0001.c500.0000.guest.test0
started
+ flux job cancel 7600078848
+ flux job wait-event -v 7600078848 clean
1563636326.996614 submit userid=1000 priority=16 flags=0
1563636327.010464 depend
1563636327.013440 alloc note="rank[0-3]/core0"
1563636327.028394 start
1563636327.142972 exception type="cancel" severity=0 userid=1000 note=""
1563636327.148500 finish status=36608
1563636327.155994 release ranks="all" final=true
1563636327.156783 free
1563636327.156801 clean
Run 2
flux-start: warning: setting --bootstrap=selfpmi due to --size option
+ flux getattr size
+ size=4
+ flux jobspec srun -N4 sh -c flux kvs put test${FLUX_TASK_RANK}=started && sleep 60
+ flux job submit
+ id=7549747200
+ flux job id --to=kvs 7549747200
+ kvsdir=job.0000.0001.c200.0000
+ flux kvs get --waitcreate job.0000.0001.c200.0000.guest.test0
Maybe another subtlety I didn't notice happening here? (Sorry if I'm just being dumb)
Perhaps this is undesirable (it took me some time to figure out) and we need to think of something that is less confusing to users?
Yeah, I'd say at this point users might stumble across this kind of subtlety all the time.
I think with a flag like kvs get --waitcreate
it should probably do the safe, slow thing by default, and enable more efficient subtle modes with extra flags. This is just because of what the flag seems to imply to users. Your suggestion of adding "full" support to waitcreate and doing that by default sounds good.
Another observation is that it is very useful (and therefore tempting) to use the kvs as synchronization for part of a job like this. Maybe eventually we need a canned API and tool to do this the "right way", implemented efficiently under the covers. Especially if the KVS isn't the right approach.
One other result, if I add a flux kvs dir -R ${kvsdir}.guest
before the flux kvs get --waitcreate ...
then I can't get the test to fail anymore... or failure becomes much more rare.
(reproducer cleaned up to be less verbose):
#!/bin/sh
size=$(flux getattr size)
id=$(flux jobspec srun -N$size \
sh -c 'flux kvs put test${FLUX_TASK_RANK}=started && sleep 60' \
| flux job submit)
kvsdir=$(flux job id --to=kvs $id)
flux kvs dir -R ${kvsdir}.guest >/dev/null
flux kvs get --waitcreate ${kvsdir}.guest.test0
echo "synchronized with $id"
flux job cancel $id
flux job wait-event ${id} clean
$ for i in `seq 1 10`; do echo Run $i; ./src/cmd/flux start -s4 --bootstrap=selfpmi ./test.sh; done
Run 1
started
synchronized with 7751073792
1563637453.106305 clean
Run 2
started
synchronized with 7734296576
1563637455.350194 clean
Run 3
started
synchronized with 7767851008
1563637457.616599 clean
Run 4
started
synchronized with 7734296576
1563637459.866584 clean
Run 5
started
synchronized with 7633633280
1563637462.137419 clean
Run 6
started
synchronized with 7700742144
1563637464.386885 clean
Run 7
started
synchronized with 7650410496
1563637466.615579 clean
Run 8
started
synchronized with 7700742144
1563637468.874275 clean
Run 9
started
synchronized with 7784628224
1563637471.141163 clean
Run 10
started
synchronized with 7633633280
1563637473.368491 clean
Without the flux kvs dir -R
the test hangs on the first run.
Maybe another subtlety I didn't notice happening here? (Sorry if I'm just being dumb)
Hmmm, yeah. I would think with the sleep 60
it would always hang. Not sure how it could sometimes be successful.
I think with a flag like kvs get --waitcreate it should probably do the safe, slow thing by default, and enable more efficient subtle modes with extra flags. This is just because of what the flag seems to imply to users. Your suggestion of adding "full" support to waitcreate and doing that by default sounds good.
As another idea, perhaps for the flux-kvs
command we enable --full
all of the time since it's the common user interface? For users of the API, they have to understand what to do?
Wait a second. I thought after the job completes, the private KVS namespace is moved to the guest key. But in your reproducer you cancel a jobid that already has access to the guest namespace.
Perhaps my assumption about the former point isn't correct. The guest namespace can be created earlier than I thought?
As another idea, perhaps for the flux-kvs command we enable --full all of the time since it's the common user interface? For users of the API, they have to understand what to do?
That seems like a good first-order compromise to me, though I have to admit I'm a bit out of my depth at this time and would certainly defer to you and @garlick.
Wait a second. I thought after the job completes, the private KVS namespace is moved to the guest key. But in your reproducer you cancel a jobid that already has access to the guest namespace.
The job-exec module creates the guest namespace and creates a link to it from job.<id>.guest
. This is guaranteed to have completed before the start
event is emitted for the job.
Once the job is complete, the link is replaced with a copy of the guest namespace, making it read-only.
The job-exec module creates the guest namespace and creates a link to it from job.
.guest. This is guaranteed to have completed before the start event is emitted for the job.
Ahhh, I was thinking it was only created with the move at the end.
If the initial check fails on flux kvs get --waitcreate ${kvsdir}.guest.test0
, we know we'll never see this key. So there's probably just a bit of raciness with the initial create of that link. Perhaps the flux kvs dir -R
slows things done enough to make things work everytime. Last night, I was unable to reproduce by just sticking a few flux kvs version
calls into your reproducer.
I added a flux job wait-event start ${id}
to the test (I thought I had tried this before), and now I haven't seen a failure. This ensures that the guest namespace is created and linked before doing the flux kvs get --waitcreate
. Perhaps that is also sufficient?
#!/bin/sh
size=$(flux getattr size)
id=$(flux jobspec srun -N$size \
sh -c 'flux kvs put test${FLUX_TASK_RANK}=started && sleep 60' \
| flux job submit)
kvsdir=$(flux job id --to=kvs $id)
#flux kvs dir -R ${kvsdir}.guest >/dev/null
flux job wait-event ${id} start
flux kvs get --waitcreate ${kvsdir}.guest.test0
echo "synchronized with $id"
flux job cancel $id
flux job wait-event ${id} clean
If the initial check fails on flux kvs get --waitcreate ${kvsdir}.guest.test0, we know we'll never see this key.
I'm probably missing something, but isn't that the purpose of the --waitcreate
flag, to wait for a key that may not exist at the first check?
I'm probably missing something, but isn't that the purpose of the --waitcreate flag, to wait for a key that may not exist at the first check?
Yes, but kvs-watch
only checks exact matches on paths. We know that the specific path job.id.guest.test0
will never be created, only test0
and job.id.guest
are created.
We know that the specific path job.id.guest.test0 will never be created
I need a tutorial :laughing: How can we know job.id.guest.test0 will never be created? What if I delete the link and then run flux kvs put job.id.guest.test0=1
?
I need a tutorial laughing How can we know job.id.guest.test0 will never be created?
:-) I'm probably not explaining it well.
The kvs-watch
module (without the --full
option) matches changes on exact paths specified. So for example:
flux kvs get --watch a.b
flux kvs put a.b=1
the change to a.b
would get detected, b/c we are specifically changing the key a.b
.
flux kvs get --watch a.b &
flux kvs put c.b=2;
TREEOBJ=$(flux kvs get --treeobj c)
flux kvs put --treeobj a=$TREEOBJ
In this latter example, we've changed the value of a.b
, but it's through an overwrite of the entire directory of a
. To the kvs-watch module, only a
has been changed, not a.b
. Because the specific key a.b
is not specified in this change, the change will be missed by the watch.
--full
would detect this change. --full
basically makes the watch heavyweight. Anytime any change occurs in the KVS, a lookup is done to see if a.b
has changed. This was the way watch was implemented in the past, but it was far too heavyweight.
What if I delete the link and then run flux kvs put job.id.guest.test0=1?
So this should work.
Thinking about this a tad more, perhaps allowing --waitcreate and --full to be specified together would make things better? B/c right now that isn't allowed. Skimming through code, I think it is supported in the kvs-watch module, but --waitcreate and --full just isn't allowed in the flux-kvs command.
I tried to add this support this morning, but it seems the kvs-watch
module doesn't grok WAITCREATE
and FULL
flags at the moment. I'm going to go ahead and create an issue for it.
Hm, I'm still getting a failure related to this issue and I don't quite understand what I'm doing wrong (probably something dumb!)
I have the following testcase which fails to synchronize with the "job" when using flux kvs get --watch --waitcreate --full -c1 ...
, but succeeds if I replace this command with a call to flux kvs get
in a loop: (please excuse the excess of debug lines)
kvs_get_retry() {
id=$1
key=$(job_kvsdir $id).$2
i=0
flux kvs get $key
while [ $? -ne 0 ] && [ $i -lt 25 ]; do
i=$((i+1))
echo >&2 "try $i"
flux dmesg | grep $id >&2
flux kvs get $key
done
if [ "$i" -eq "25" ]; then return 1; fi
return 0
}
test_expect_success 'job-exec: job exception uses SIGKILL after kill-timeout'
flux setattr job-exec.kill_timeout 0.2 &&
cat <<-EOF >trap-sigterm.sh &&
#!/bin/dash
trap "echo trap-sigterm got SIGTERM" 15
flux kvs put trap-ready=1
sleep 60 &
pid=\$!
wait \$pid
EOF
chmod +x trap-sigterm.sh &&
id=$(TRAP=15 flux jobspec srun -N4 ./trap-sigterm.sh \
| flux job submit) &&
flux job wait-event -vt 1 $id start &&
test_debug "echo waiting for trap-ready in kvs" &&
#kvs_get_retry $id guest.trap-ready &&
run_timeout 5 flux kvs get -c1 --watch --full --waitcreate \
$(job_kvsdir $id).guest.trap-ready &&
flux job cancel $id &&
test_debug "echo waiting for $id clean" &&
flux job wait-event -vt 1 $id clean &&
flux dmesg | grep $id > trap.out &&
test_debug "cat trap.out" &&
grep "trap-sigterm got SIGTERM" trap.out &&
flux setattr --expunge job-exec.kill_timeout
'
The version above hangs in flux kvs get --watch ...
:
expecting success:
flux setattr job-exec.kill_timeout 0.2 &&
cat <<-EOF >trap-sigterm.sh &&
#!/bin/dash
trap "echo trap-sigterm got SIGTERM" 15
flux kvs put trap-ready=1
sleep 60 &
pid=\$!
wait \$pid
EOF
chmod +x trap-sigterm.sh &&
id=$(TRAP=15 flux jobspec srun -N4 ./trap-sigterm.sh \
| flux job submit) &&
flux job wait-event -vt 1 $id start &&
test_debug "echo waiting for trap-ready in kvs" &&
#kvs_get_retry $id guest.trap-ready &&
run_timeout 5 flux kvs get -c1 --watch --full --waitcreate \
$(job_kvsdir $id).guest.trap-ready &&
flux job cancel $id &&
test_debug "echo waiting for $id clean" &&
flux job wait-event -vt 1 $id clean &&
flux dmesg | grep $id > trap.out &&
test_debug "cat trap.out" &&
grep "trap-sigterm got SIGTERM" trap.out &&
flux setattr --expunge job-exec.kill_timeout
1563916681.164301 submit userid=1000 priority=16 flags=0
1563916681.177734 depend
1563916681.180455 alloc note="rank[0-3]/core0"
1563916681.192677 start
waiting for trap-ready in kvs
Alarm clock
not ok 8 - job-exec: job exception uses SIGKILL after kill-timeout
If I switch to the kvs-get-in-a-loop, the test succeeds
expecting success:
flux setattr job-exec.kill_timeout 0.2 &&
cat <<-EOF >trap-sigterm.sh &&
#!/bin/dash
trap "echo trap-sigterm got SIGTERM" 15
flux kvs put trap-ready=1
sleep 60 &
pid=\$!
wait \$pid
EOF
chmod +x trap-sigterm.sh &&
id=$(TRAP=15 flux jobspec srun -N4 ./trap-sigterm.sh \
| flux job submit) &&
flux job wait-event -vt 1 $id start &&
test_debug "echo waiting for trap-ready in kvs" &&
kvs_get_retry $id guest.trap-ready &&
#run_timeout 5 flux kvs get -c1 --watch --full --waitcreate \
# $(job_kvsdir $id).guest.trap-ready &&
flux job cancel $id &&
test_debug "echo waiting for $id clean" &&
flux job wait-event -vt 1 $id clean &&
flux dmesg | grep $id > trap.out &&
test_debug "cat trap.out" &&
grep "trap-sigterm got SIGTERM" trap.out &&
flux setattr --expunge job-exec.kill_timeout
1563916765.502804 submit userid=1000 priority=16 flags=0
1563916765.516373 depend
1563916765.519073 alloc note="rank[0-3]/core0"
1563916765.532530 start
waiting for trap-ready in kvs
flux-kvs: job.0000.0009.a100.0000.guest.trap-ready: No such file or directory
try 1
2019-07-23T21:19:25.517319Z sched-simple.debug[0]: req: 41355837440: spec={4,4,1}
2019-07-23T21:19:25.517430Z sched-simple.debug[0]: alloc: 41355837440: rank[0-3]/core0
2019-07-23T21:19:25.660029Z job-exec.info[0]: 41355837440: 0: STDOUT: Running ./trap-sigterm.sh
2019-07-23T21:19:25.671528Z job-exec.info[0]: 41355837440: 1: STDOUT: Running ./trap-sigterm.sh
2019-07-23T21:19:25.673476Z job-exec.info[0]: 41355837440: 3: STDOUT: Running ./trap-sigterm.sh
2019-07-23T21:19:25.695626Z job-exec.info[0]: 41355837440: 2: STDOUT: Running ./trap-sigterm.sh
1
waiting for 41355837440 clean
1563916765.502804 submit userid=1000 priority=16 flags=0
1563916765.516373 depend
1563916765.519073 alloc note="rank[0-3]/core0"
1563916765.532530 start
1563916765.774688 exception type="cancel" severity=0 userid=1000 note=""
1563916765.780810 finish status=36608
1563916765.786882 release ranks="all" final=true
1563916765.787584 free
1563916765.787603 clean
2019-07-23T21:19:25.517319Z sched-simple.debug[0]: req: 41355837440: spec={4,4,1}
2019-07-23T21:19:25.517430Z sched-simple.debug[0]: alloc: 41355837440: rank[0-3]/core0
2019-07-23T21:19:25.660029Z job-exec.info[0]: 41355837440: 0: STDOUT: Running ./trap-sigterm.sh
2019-07-23T21:19:25.671528Z job-exec.info[0]: 41355837440: 1: STDOUT: Running ./trap-sigterm.sh
2019-07-23T21:19:25.673476Z job-exec.info[0]: 41355837440: 3: STDOUT: Running ./trap-sigterm.sh
2019-07-23T21:19:25.695626Z job-exec.info[0]: 41355837440: 2: STDOUT: Running ./trap-sigterm.sh
2019-07-23T21:19:25.774887Z job-exec.debug[0]: exec aborted: id=41355837440
2019-07-23T21:19:25.774969Z job-exec.debug[0]: exec_kill: 41355837440: signal 15
2019-07-23T21:19:25.775366Z job-exec.info[0]: 41355837440: 0: STDOUT: trap-sigterm got SIGTERM
2019-07-23T21:19:25.777350Z job-exec.info[0]: 41355837440: 0: STDOUT: got signal 15
2019-07-23T21:19:25.777473Z job-exec.info[0]: 41355837440: 2: STDOUT: trap-sigterm got SIGTERM
2019-07-23T21:19:25.779042Z job-exec.info[0]: 41355837440: 2: STDOUT: got signal 15
2019-07-23T21:19:25.779217Z job-exec.info[0]: 41355837440: 1: STDOUT: trap-sigterm got SIGTERM
2019-07-23T21:19:25.779246Z job-exec.info[0]: 41355837440: 1: STDOUT: got signal 15
2019-07-23T21:19:25.779270Z job-exec.info[0]: 41355837440: 3: STDOUT: trap-sigterm got SIGTERM
2019-07-23T21:19:25.779495Z job-exec.info[0]: 41355837440: 3: STDOUT: got signal 15
2019-07-23T21:19:25.775366Z job-exec.info[0]: 41355837440: 0: STDOUT: trap-sigterm got SIGTERM
2019-07-23T21:19:25.777473Z job-exec.info[0]: 41355837440: 2: STDOUT: trap-sigterm got SIGTERM
2019-07-23T21:19:25.779217Z job-exec.info[0]: 41355837440: 1: STDOUT: trap-sigterm got SIGTERM
2019-07-23T21:19:25.779270Z job-exec.info[0]: 41355837440: 3: STDOUT: trap-sigterm got SIGTERM
ok 8 - job-exec: job exception uses SIGKILL after kill-timeout
Though I haven't been able to make a simple reproducer for the failure above. What am I doing wrong?
I think for now I'll just switch to the loop so I can move the PR forward...
Can't see anything obvious. Perhaps there's something racy internally i just can't see. Are you running this out of master?
No, this is on my job-kill
branch.
Based on suggestion from @garlick, there is no way this will work. The working case may only be accidentally working because of other commits happening on the kvs main namespace at the time.
When I thought about this last night, my gut feeling was that --full
doesn't actually do anything with --waitcreate
or there is some corner case there that is missed.
This issue has been automatically marked as stale because it has not had activity for 365 days. It will be closed if no further activity occurs within 14 days. Thank you for your contributions.
In a test I was perhaps foolishly trying to synchronize with a test job via a kvs entry and getting occasional hangs. The following initial program is a simple reproducer:
This script will hang about once every 10 runs with
start -s1
and once every 4 or so runs with-s4
. It may be that my usage is inherently racy with creation of the namespace link, but in talking with @garlick it sounded like perhaps that was supposed to be handled inWAITCREATE
. Either way, I thought I'd report it