content: pending stores are stuck after ENOSPC from backing store

garlick commented 3 months ago

Problem: after the root filesystem ran out of space on elcap, kvs commits would hang

We observed that kvs, content, and content-sqlite were all responsive to pings and we could even load/store blobs, but flux kvs put a=b and flux content flush would hang.

Reloading the kvs module caused it to become responsive again (but of course that lost all the namespaces for running jobs)

Reloading content-sqlite seemed to have no effect on stuck cache entries and flux content flush still hangs

[  +5.494872] broker[0]: rmmod content-sqlite
[  +5.495008] content[0]: content backing store: disabled
[  +5.495076] content[0]: 167 unflushables
[  +5.546204] broker[0]: module content-sqlite exited
[  +6.111085] broker[0]: insmod content-sqlite
[ +14.287310] content-sqlite[0]: /var/lib/flux/content.sqlite (397027 objects) journal_mode=WAL synchronous=NORMAL
[ +14.287613] content[0]: content backing store: enabled content-sqlite

These dirty entries seem like they are going to stay dirty forever.

{
 "count": 1108,
 "valid": 1108,
 "dirty": 167,
 "size": 2363238,
 "flush-batch-count": 0,
 "mmap": {
  "tags": {},
  "blobs": 0
 }
}

chu11 commented 2 months ago

I've been trying to reproduce this via the setup described in #6010 but haven't been able to. Wondering if

A) there's another bug going on at the same time

B) the setup in #6010 only makes the statedir run out of space. Where as it appears in this issue the entire disk was full, leading to other possible fallout leading to an unhappy broker.

Edit: I lightly investigated if it would be psosible to test/reproduce case 'B' in docker, but to no avail.

garlick commented 2 months ago

the entire disk was full, leading to other possible fallout leading to an unhappy broker.

The other directory the broker may writes to is rundir but I think just at startup if statedir is defined.

chu11 commented 2 months ago

The other directory the broker may writes to is rundir but I think just at startup if statedir is defined.

I had tried that as well and didn't get a similar failure

flux start --test-size=4 -o,-Sstatedir=/test/tmpfs-5m/mydir -o,-Srundir=/test/tmpfs-5m/mydir

chu11 commented 1 month ago

while working on #6100 I have a theory why this can happen now. if some KVS transaction is doing a KVS_SYNC, that kvs commit can never complete, therefore it can "block" the following KVS commits. There may be other similar things that can happen in the KVS, although that's the only one I've identified at the moment. Should be half easily reproduceable.

If proven, solution for #6124 is perhaps in order ... or we need to error out on ENOSPC.

chu11 commented 1 month ago

ok, I think I've proven that the FLUX_KVS_SYNC flag (or some variant of this) is the cause of the hang. Using the /test/tmpfs-1m from #6127

rm -rf /test/tmpfs-1m/statedir/*
src/cmd/flux start -o,-Sstatedir=/test/tmpfs-1m/statedir -o,-Sbroker.rc3_path=
<inside the instance>
flux submit --cc=1-1000 --wait echo 0123456789  // to fill up disk
flux kvs put a=1 // works
flux kvs put --sync a=2 & // hangs
flux kvs put b=1 // hangs

the most likely culprit of a synced KVS write is probably KVS checkpoint? Dunno if we have a checkpoint period setup on el cap.

Edit: confirmed there is a checkpoint period of 30m on elcap.

flux-framework / flux-core

content: pending stores are stuck after ENOSPC from backing store #5978