flux kvs get: racy with namespace creation?

grondo commented 5 years ago

In a test I was perhaps foolishly trying to synchronize with a test job via a kvs entry and getting occasional hangs. The following initial program is a simple reproducer:

#!/bin/sh
size=$(flux getattr size)
id=$(flux jobspec srun -N$size \
         sh -c 'flux kvs put test${FLUX_TASK_RANK}=started' \
            | flux job submit)
kvsdir=$(flux job id --to=kvs $id)
flux kvs get --waitcreate ${kvsdir}.guest.test0

This script will hang about once every 10 runs with start -s1 and once every 4 or so runs with -s4. It may be that my usage is inherently racy with creation of the namespace link, but in talking with @garlick it sounded like perhaps that was supposed to be handled in WAITCREATE. Either way, I thought I'd report it

$ for i in `seq 1 10`; do echo Run $i; ./src/cmd/flux start -s4  sh -x ./test.sh; done
Run 1
flux-start: warning: setting --bootstrap=selfpmi due to --size option
+ flux getattr size
+ size=4
+ flux jobspec srun -N4 sh -c flux kvs put test${FLUX_TASK_RANK}=started
+ flux job submit
+ id=7868514304
+ flux job id --to=kvs 7868514304
+ kvsdir=job.0000.0001.d500.0000
+ flux kvs get --waitcreate job.0000.0001.d500.0000.guest.test0
started
Run 2
flux-start: warning: setting --bootstrap=selfpmi due to --size option
+ flux getattr size
+ size=4
+ + fluxflux job jobspec submit srun
 -N4 sh -c flux kvs put test${FLUX_TASK_RANK}=started
+ id=7616856064
+ flux job id --to=kvs 7616856064
+ kvsdir=job.0000.0001.c600.0000
+ flux kvs get --waitcreate job.0000.0001.c600.0000.guest.test0
started
Run 3
flux-start: warning: setting --bootstrap=selfpmi due to --size option
+ flux getattr size
+ size=4
+ flux jobspec srun -N4 sh -c flux kvs put test${FLUX_TASK_RANK}=started
+ flux job submit
+ id=7532969984
+ flux job id --to=kvs 7532969984
+ kvsdir=job.0000.0001.c100.0000
+ flux kvs get --waitcreate job.0000.0001.c100.0000.guest.test0
started
Run 4
flux-start: warning: setting --bootstrap=selfpmi due to --size option
+ flux getattr size
+ size=4
+ flux jobspec srun -N4 sh -c flux kvs put test${FLUX_TASK_RANK}=started
+ flux job submit
+ id=7751073792
+ flux job id --to=kvs 7751073792
+ kvsdir=job.0000.0001.ce00.0000
+ flux kvs get --waitcreate job.0000.0001.ce00.0000.guest.test0
started
Run 5
flux-start: warning: setting --bootstrap=selfpmi due to --size option
+ flux getattr size
+ size=4
+ flux jobspec srun -N4 sh -c flux kvs put test${FLUX_TASK_RANK}=started
+ flux job submit
+ id=7667187712
+ flux job id --to=kvs 7667187712
+ kvsdir=job.0000.0001.c900.0000
+ flux kvs get --waitcreate job.0000.0001.c900.0000.guest.test0

chu11 commented 5 years ago

Hmmm, internally, the kvs-watch module watches for the exact key path being written to. I think the issue may be that you are writing to the key test0 but waiting on the full key path job.ID.guest.test0. The kvs_copy that is done underneath the covers by job-exec copies the namespace to job.ID.guest, so the creation of the key job.ID.guest.test0 is never seen.

I think the only reason it passes sometimes is because the namespace has already been copied over fully before the --waitcreate is called (i.e. the job has completely finished and has exited).

I tried doing:

flux kvs get --waitcreate --watch --full -c1 ${kvsdir}.guest.test0

and no longer got any hangs. The --watch --full -c1 basically gets the first value from test0 no matter what keys are specified (it does a full lookup on every setroot event that's sent out). The --waitcreate simply will wait if the key doesn't exist at first.

Perhaps this is undesirable (it took me some time to figure out) and we need to think of something that is less confusing to users?

chu11 commented 5 years ago

Thinking about this a tad more, perhaps allowing --waitcreate and --full to be specified together would make things better? B/c right now that isn't allowed. Skimming through code, I think it is supported in the kvs-watch module, but --waitcreate and --full just isn't allowed in the flux-kvs command.

grondo commented 5 years ago

Now that you reminded me how it works, the result makes sense.

I think the only reason it passes sometimes is because the namespace has already been copied over fully before the --waitcreate is called (i.e. the job has completely finished and has exited).

This makes sense, but my original testcase was more like the script below (we wait to synchronize with the job, then cancel it), and it was sometimes successful:

#!/bin/sh
size=$(flux getattr size)
id=$(flux jobspec srun -N$size \
         sh -c 'flux kvs put test${FLUX_TASK_RANK}=started && sleep 60' \
            | flux job submit)
kvsdir=$(flux job id --to=kvs $id)
flux kvs get --waitcreate ${kvsdir}.guest.test0
flux job cancel $id
flux job wait-event -v ${id} clean

$ for i in `seq 1 10`; do echo Run $i; ./src/cmd/flux start -s4  sh -x ./test.sh; done
Run 1
flux-start: warning: setting --bootstrap=selfpmi due to --size option
+ flux getattr size
+ size=4
+ flux jobspec srun -N4 sh+  -c flux kvs put test${FLUX_TASK_RANK}=started && sleep 60
flux job submit
+ id=7600078848
+ flux job id --to=kvs 7600078848
+ kvsdir=job.0000.0001.c500.0000
+ flux kvs get --waitcreate job.0000.0001.c500.0000.guest.test0
started
+ flux job cancel 7600078848
+ flux job wait-event -v 7600078848 clean
1563636326.996614 submit userid=1000 priority=16 flags=0
1563636327.010464 depend
1563636327.013440 alloc note="rank[0-3]/core0"
1563636327.028394 start
1563636327.142972 exception type="cancel" severity=0 userid=1000 note=""
1563636327.148500 finish status=36608
1563636327.155994 release ranks="all" final=true
1563636327.156783 free
1563636327.156801 clean
Run 2
flux-start: warning: setting --bootstrap=selfpmi due to --size option
+ flux getattr size
+ size=4
+ flux jobspec srun -N4 sh -c flux kvs put test${FLUX_TASK_RANK}=started && sleep 60
+ flux job submit
+ id=7549747200
+ flux job id --to=kvs 7549747200
+ kvsdir=job.0000.0001.c200.0000
+ flux kvs get --waitcreate job.0000.0001.c200.0000.guest.test0

Maybe another subtlety I didn't notice happening here? (Sorry if I'm just being dumb)

grondo commented 5 years ago

Perhaps this is undesirable (it took me some time to figure out) and we need to think of something that is less confusing to users?

Yeah, I'd say at this point users might stumble across this kind of subtlety all the time.

I think with a flag like kvs get --waitcreate it should probably do the safe, slow thing by default, and enable more efficient subtle modes with extra flags. This is just because of what the flag seems to imply to users. Your suggestion of adding "full" support to waitcreate and doing that by default sounds good.

Another observation is that it is very useful (and therefore tempting) to use the kvs as synchronization for part of a job like this. Maybe eventually we need a canned API and tool to do this the "right way", implemented efficiently under the covers. Especially if the KVS isn't the right approach.

grondo commented 5 years ago

One other result, if I add a flux kvs dir -R ${kvsdir}.guest before the flux kvs get --waitcreate ... then I can't get the test to fail anymore... or failure becomes much more rare.

(reproducer cleaned up to be less verbose):

#!/bin/sh
size=$(flux getattr size)
id=$(flux jobspec srun -N$size \
         sh -c 'flux kvs put test${FLUX_TASK_RANK}=started && sleep 60' \
            | flux job submit)
kvsdir=$(flux job id --to=kvs $id)
flux kvs dir -R ${kvsdir}.guest >/dev/null
flux kvs get --waitcreate ${kvsdir}.guest.test0
echo "synchronized with $id"
flux job cancel $id
flux job wait-event ${id} clean

$ for i in `seq 1 10`; do echo Run $i; ./src/cmd/flux start -s4 --bootstrap=selfpmi  ./test.sh; done
Run 1
started
synchronized with 7751073792
1563637453.106305 clean
Run 2
started
synchronized with 7734296576
1563637455.350194 clean
Run 3
started
synchronized with 7767851008
1563637457.616599 clean
Run 4
started
synchronized with 7734296576
1563637459.866584 clean
Run 5
started
synchronized with 7633633280
1563637462.137419 clean
Run 6
started
synchronized with 7700742144
1563637464.386885 clean
Run 7
started
synchronized with 7650410496
1563637466.615579 clean
Run 8
started
synchronized with 7700742144
1563637468.874275 clean
Run 9
started
synchronized with 7784628224
1563637471.141163 clean
Run 10
started
synchronized with 7633633280
1563637473.368491 clean

Without the flux kvs dir -R the test hangs on the first run.

chu11 commented 5 years ago

Maybe another subtlety I didn't notice happening here? (Sorry if I'm just being dumb)

Hmmm, yeah. I would think with the sleep 60 it would always hang. Not sure how it could sometimes be successful.

I think with a flag like kvs get --waitcreate it should probably do the safe, slow thing by default, and enable more efficient subtle modes with extra flags. This is just because of what the flag seems to imply to users. Your suggestion of adding "full" support to waitcreate and doing that by default sounds good.

As another idea, perhaps for the flux-kvs command we enable --full all of the time since it's the common user interface? For users of the API, they have to understand what to do?

chu11 commented 5 years ago

Wait a second. I thought after the job completes, the private KVS namespace is moved to the guest key. But in your reproducer you cancel a jobid that already has access to the guest namespace.

Perhaps my assumption about the former point isn't correct. The guest namespace can be created earlier than I thought?

grondo commented 5 years ago

As another idea, perhaps for the flux-kvs command we enable --full all of the time since it's the common user interface? For users of the API, they have to understand what to do?

That seems like a good first-order compromise to me, though I have to admit I'm a bit out of my depth at this time and would certainly defer to you and @garlick.

grondo commented 5 years ago

Wait a second. I thought after the job completes, the private KVS namespace is moved to the guest key. But in your reproducer you cancel a jobid that already has access to the guest namespace.

The job-exec module creates the guest namespace and creates a link to it from job.<id>.guest. This is guaranteed to have completed before the start event is emitted for the job.

Once the job is complete, the link is replaced with a copy of the guest namespace, making it read-only.

chu11 commented 5 years ago

The job-exec module creates the guest namespace and creates a link to it from job..guest. This is guaranteed to have completed before the start event is emitted for the job.

Ahhh, I was thinking it was only created with the move at the end.

If the initial check fails on flux kvs get --waitcreate ${kvsdir}.guest.test0, we know we'll never see this key. So there's probably just a bit of raciness with the initial create of that link. Perhaps the flux kvs dir -R slows things done enough to make things work everytime. Last night, I was unable to reproduce by just sticking a few flux kvs version calls into your reproducer.

grondo commented 5 years ago

I added a flux job wait-event start ${id} to the test (I thought I had tried this before), and now I haven't seen a failure. This ensures that the guest namespace is created and linked before doing the flux kvs get --waitcreate. Perhaps that is also sufficient?

#!/bin/sh
size=$(flux getattr size)
id=$(flux jobspec srun -N$size \
         sh -c 'flux kvs put test${FLUX_TASK_RANK}=started && sleep 60' \
            | flux job submit)
kvsdir=$(flux job id --to=kvs $id)
#flux kvs dir -R ${kvsdir}.guest >/dev/null
flux job wait-event ${id} start
flux kvs get --waitcreate ${kvsdir}.guest.test0
echo "synchronized with $id"
flux job cancel $id
flux job wait-event ${id} clean

grondo commented 5 years ago

If the initial check fails on flux kvs get --waitcreate ${kvsdir}.guest.test0, we know we'll never see this key.

I'm probably missing something, but isn't that the purpose of the --waitcreate flag, to wait for a key that may not exist at the first check?

chu11 commented 5 years ago

I'm probably missing something, but isn't that the purpose of the --waitcreate flag, to wait for a key that may not exist at the first check?

Yes, but kvs-watch only checks exact matches on paths. We know that the specific path job.id.guest.test0 will never be created, only test0 and job.id.guest are created.

grondo commented 5 years ago

We know that the specific path job.id.guest.test0 will never be created

I need a tutorial :laughing: How can we know job.id.guest.test0 will never be created? What if I delete the link and then run flux kvs put job.id.guest.test0=1?

chu11 commented 5 years ago

I need a tutorial laughing How can we know job.id.guest.test0 will never be created?

:-) I'm probably not explaining it well.

The kvs-watch module (without the --full option) matches changes on exact paths specified. So for example:

flux kvs get --watch a.b
flux kvs put a.b=1

the change to a.b would get detected, b/c we are specifically changing the key a.b.

flux kvs get --watch a.b &
flux kvs put c.b=2;
TREEOBJ=$(flux kvs get --treeobj c)
flux kvs put --treeobj a=$TREEOBJ

In this latter example, we've changed the value of a.b, but it's through an overwrite of the entire directory of a. To the kvs-watch module, only a has been changed, not a.b. Because the specific key a.b is not specified in this change, the change will be missed by the watch.

--full would detect this change. --full basically makes the watch heavyweight. Anytime any change occurs in the KVS, a lookup is done to see if a.b has changed. This was the way watch was implemented in the past, but it was far too heavyweight.

What if I delete the link and then run flux kvs put job.id.guest.test0=1?

So this should work.

chu11 commented 5 years ago

Thinking about this a tad more, perhaps allowing --waitcreate and --full to be specified together would make things better? B/c right now that isn't allowed. Skimming through code, I think it is supported in the kvs-watch module, but --waitcreate and --full just isn't allowed in the flux-kvs command.

I tried to add this support this morning, but it seems the kvs-watch module doesn't grok WAITCREATE and FULL flags at the moment. I'm going to go ahead and create an issue for it.

grondo commented 5 years ago

Hm, I'm still getting a failure related to this issue and I don't quite understand what I'm doing wrong (probably something dumb!)

I have the following testcase which fails to synchronize with the "job" when using flux kvs get --watch --waitcreate --full -c1 ..., but succeeds if I replace this command with a call to flux kvs get in a loop: (please excuse the excess of debug lines)

 kvs_get_retry() {
      id=$1
      key=$(job_kvsdir $id).$2
      i=0
      flux kvs get $key
      while [ $? -ne 0 ] && [ $i -lt 25 ]; do
          i=$((i+1))
          echo >&2 "try $i"
          flux dmesg | grep $id >&2
          flux kvs get $key
      done
      if [ "$i" -eq "25" ]; then return 1; fi
      return 0
  }
  test_expect_success 'job-exec: job exception uses SIGKILL after kill-timeout'
          flux setattr job-exec.kill_timeout 0.2 &&
          cat <<-EOF >trap-sigterm.sh &&
          #!/bin/dash
          trap "echo trap-sigterm got SIGTERM" 15
          flux kvs put trap-ready=1
          sleep 60 &
          pid=\$!
          wait \$pid
          EOF
          chmod +x trap-sigterm.sh &&
          id=$(TRAP=15 flux jobspec srun -N4 ./trap-sigterm.sh \
                  | flux job submit) &&
          flux job wait-event -vt 1 $id start &&
          test_debug "echo waiting for trap-ready in kvs" &&
          #kvs_get_retry $id guest.trap-ready &&
          run_timeout 5 flux kvs get -c1 --watch --full --waitcreate \
                  $(job_kvsdir $id).guest.trap-ready &&
          flux job cancel $id &&
          test_debug "echo waiting for $id clean" &&
          flux job wait-event -vt 1 $id clean &&
          flux dmesg | grep $id > trap.out &&
          test_debug "cat trap.out" &&
          grep "trap-sigterm got SIGTERM" trap.out &&
          flux setattr --expunge job-exec.kill_timeout
  '

The version above hangs in flux kvs get --watch ...:

expecting success: 
    flux setattr job-exec.kill_timeout 0.2 &&
    cat <<-EOF >trap-sigterm.sh &&
    #!/bin/dash
    trap "echo trap-sigterm got SIGTERM" 15
    flux kvs put trap-ready=1
    sleep 60 &
    pid=\$!
    wait \$pid
    EOF
    chmod +x trap-sigterm.sh &&
    id=$(TRAP=15 flux jobspec srun -N4 ./trap-sigterm.sh \
            | flux job submit) &&
    flux job wait-event -vt 1 $id start &&
    test_debug "echo waiting for trap-ready in kvs" &&
    #kvs_get_retry $id guest.trap-ready &&
    run_timeout 5 flux kvs get -c1 --watch --full --waitcreate \
        $(job_kvsdir $id).guest.trap-ready &&
    flux job cancel $id &&
    test_debug "echo waiting for $id clean" &&
    flux job wait-event -vt 1 $id clean &&
    flux dmesg | grep $id > trap.out &&
    test_debug "cat trap.out" &&
    grep "trap-sigterm got SIGTERM" trap.out &&
    flux setattr --expunge job-exec.kill_timeout

1563916681.164301 submit userid=1000 priority=16 flags=0
1563916681.177734 depend
1563916681.180455 alloc note="rank[0-3]/core0"
1563916681.192677 start
waiting for trap-ready in kvs
Alarm clock
not ok 8 - job-exec: job exception uses SIGKILL after kill-timeout

If I switch to the kvs-get-in-a-loop, the test succeeds

expecting success: 
    flux setattr job-exec.kill_timeout 0.2 &&
    cat <<-EOF >trap-sigterm.sh &&
    #!/bin/dash
    trap "echo trap-sigterm got SIGTERM" 15
    flux kvs put trap-ready=1
    sleep 60 &
    pid=\$!
    wait \$pid
    EOF
    chmod +x trap-sigterm.sh &&
    id=$(TRAP=15 flux jobspec srun -N4 ./trap-sigterm.sh \
            | flux job submit) &&
    flux job wait-event -vt 1 $id start &&
    test_debug "echo waiting for trap-ready in kvs" &&
    kvs_get_retry $id guest.trap-ready &&
    #run_timeout 5 flux kvs get -c1 --watch --full --waitcreate \
    #   $(job_kvsdir $id).guest.trap-ready &&
    flux job cancel $id &&
    test_debug "echo waiting for $id clean" &&
    flux job wait-event -vt 1 $id clean &&
    flux dmesg | grep $id > trap.out &&
    test_debug "cat trap.out" &&
    grep "trap-sigterm got SIGTERM" trap.out &&
    flux setattr --expunge job-exec.kill_timeout

1563916765.502804 submit userid=1000 priority=16 flags=0
1563916765.516373 depend
1563916765.519073 alloc note="rank[0-3]/core0"
1563916765.532530 start
waiting for trap-ready in kvs
flux-kvs: job.0000.0009.a100.0000.guest.trap-ready: No such file or directory
try 1
2019-07-23T21:19:25.517319Z sched-simple.debug[0]: req: 41355837440: spec={4,4,1}
2019-07-23T21:19:25.517430Z sched-simple.debug[0]: alloc: 41355837440: rank[0-3]/core0
2019-07-23T21:19:25.660029Z job-exec.info[0]: 41355837440: 0: STDOUT: Running ./trap-sigterm.sh
2019-07-23T21:19:25.671528Z job-exec.info[0]: 41355837440: 1: STDOUT: Running ./trap-sigterm.sh
2019-07-23T21:19:25.673476Z job-exec.info[0]: 41355837440: 3: STDOUT: Running ./trap-sigterm.sh
2019-07-23T21:19:25.695626Z job-exec.info[0]: 41355837440: 2: STDOUT: Running ./trap-sigterm.sh
1
waiting for 41355837440 clean
1563916765.502804 submit userid=1000 priority=16 flags=0
1563916765.516373 depend
1563916765.519073 alloc note="rank[0-3]/core0"
1563916765.532530 start
1563916765.774688 exception type="cancel" severity=0 userid=1000 note=""
1563916765.780810 finish status=36608
1563916765.786882 release ranks="all" final=true
1563916765.787584 free
1563916765.787603 clean
2019-07-23T21:19:25.517319Z sched-simple.debug[0]: req: 41355837440: spec={4,4,1}
2019-07-23T21:19:25.517430Z sched-simple.debug[0]: alloc: 41355837440: rank[0-3]/core0
2019-07-23T21:19:25.660029Z job-exec.info[0]: 41355837440: 0: STDOUT: Running ./trap-sigterm.sh
2019-07-23T21:19:25.671528Z job-exec.info[0]: 41355837440: 1: STDOUT: Running ./trap-sigterm.sh
2019-07-23T21:19:25.673476Z job-exec.info[0]: 41355837440: 3: STDOUT: Running ./trap-sigterm.sh
2019-07-23T21:19:25.695626Z job-exec.info[0]: 41355837440: 2: STDOUT: Running ./trap-sigterm.sh
2019-07-23T21:19:25.774887Z job-exec.debug[0]: exec aborted: id=41355837440
2019-07-23T21:19:25.774969Z job-exec.debug[0]: exec_kill: 41355837440: signal 15
2019-07-23T21:19:25.775366Z job-exec.info[0]: 41355837440: 0: STDOUT: trap-sigterm got SIGTERM
2019-07-23T21:19:25.777350Z job-exec.info[0]: 41355837440: 0: STDOUT: got signal 15
2019-07-23T21:19:25.777473Z job-exec.info[0]: 41355837440: 2: STDOUT: trap-sigterm got SIGTERM
2019-07-23T21:19:25.779042Z job-exec.info[0]: 41355837440: 2: STDOUT: got signal 15
2019-07-23T21:19:25.779217Z job-exec.info[0]: 41355837440: 1: STDOUT: trap-sigterm got SIGTERM
2019-07-23T21:19:25.779246Z job-exec.info[0]: 41355837440: 1: STDOUT: got signal 15
2019-07-23T21:19:25.779270Z job-exec.info[0]: 41355837440: 3: STDOUT: trap-sigterm got SIGTERM
2019-07-23T21:19:25.779495Z job-exec.info[0]: 41355837440: 3: STDOUT: got signal 15
2019-07-23T21:19:25.775366Z job-exec.info[0]: 41355837440: 0: STDOUT: trap-sigterm got SIGTERM
2019-07-23T21:19:25.777473Z job-exec.info[0]: 41355837440: 2: STDOUT: trap-sigterm got SIGTERM
2019-07-23T21:19:25.779217Z job-exec.info[0]: 41355837440: 1: STDOUT: trap-sigterm got SIGTERM
2019-07-23T21:19:25.779270Z job-exec.info[0]: 41355837440: 3: STDOUT: trap-sigterm got SIGTERM
ok 8 - job-exec: job exception uses SIGKILL after kill-timeout

Though I haven't been able to make a simple reproducer for the failure above. What am I doing wrong?

I think for now I'll just switch to the loop so I can move the PR forward...

chu11 commented 5 years ago

Can't see anything obvious. Perhaps there's something racy internally i just can't see. Are you running this out of master?

grondo commented 5 years ago

No, this is on my job-kill branch.

Based on suggestion from @garlick, there is no way this will work. The working case may only be accidentally working because of other commits happening on the kvs main namespace at the time.

chu11 commented 5 years ago

When I thought about this last night, my gut feeling was that --full doesn't actually do anything with --waitcreate or there is some corner case there that is missed.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had activity for 365 days. It will be closed if no further activity occurs within 14 days. Thank you for your contributions.

flux-framework / flux-core

flux kvs get: racy with namespace creation? #2249