Open garlick opened 3 months ago
I've been trying to reproduce this via the setup described in #6010 but haven't been able to. Wondering if
A) there's another bug going on at the same time
B) the setup in #6010 only makes the statedir run out of space. Where as it appears in this issue the entire disk was full, leading to other possible fallout leading to an unhappy broker.
Edit: I lightly investigated if it would be psosible to test/reproduce case 'B' in docker, but to no avail.
the entire disk was full, leading to other possible fallout leading to an unhappy broker.
The other directory the broker may writes to is rundir
but I think just at startup if statedir
is defined.
The other directory the broker may writes to is rundir but I think just at startup if statedir is defined.
I had tried that as well and didn't get a similar failure
flux start --test-size=4 -o,-Sstatedir=/test/tmpfs-5m/mydir -o,-Srundir=/test/tmpfs-5m/mydir
while working on #6100 I have a theory why this can happen now. if some KVS transaction is doing a KVS_SYNC, that kvs commit can never complete, therefore it can "block" the following KVS commits. There may be other similar things that can happen in the KVS, although that's the only one I've identified at the moment. Should be half easily reproduceable.
If proven, solution for #6124 is perhaps in order ... or we need to error out on ENOSPC.
ok, I think I've proven that the FLUX_KVS_SYNC flag (or some variant of this) is the cause of the hang. Using the /test/tmpfs-1m
from #6127
rm -rf /test/tmpfs-1m/statedir/*
src/cmd/flux start -o,-Sstatedir=/test/tmpfs-1m/statedir -o,-Sbroker.rc3_path=
<inside the instance>
flux submit --cc=1-1000 --wait echo 0123456789 // to fill up disk
flux kvs put a=1 // works
flux kvs put --sync a=2 & // hangs
flux kvs put b=1 // hangs
the most likely culprit of a synced KVS write is probably KVS checkpoint? Dunno if we have a checkpoint period setup on el cap.
Edit: confirmed there is a checkpoint period of 30m on elcap.
Problem: after the root filesystem ran out of space on elcap, kvs commits would hang
We observed that
kvs
,content
, andcontent-sqlite
were all responsive to pings and we could even load/store blobs, butflux kvs put a=b
andflux content flush
would hang.Reloading the kvs module caused it to become responsive again (but of course that lost all the namespaces for running jobs)
Reloading content-sqlite seemed to have no effect on stuck cache entries and
flux content flush
still hangsThese dirty entries seem like they are going to stay dirty forever.