Closed MathieuBordere closed 1 year ago
What's going on immediately before the crash: n1 is a standby and starts to finalize an open segment ending with index 674. Then it's converted to being a voter, drops from FOLLOWER to UNAVAILABLE, and decides to truncate its log because it previously saw a mismatch at index 674. There is a rash of I/O failures that are ignored because they originated with the finalization attempt that started when n1 was still a follower/non-voter. The last log line before the assertion failure is
src/uv_finalize.c:39 finalize open-7 into 0000000000000674-0000000000000000
The fact that the end index is 0 looks ominous to me. Is that supposed to happen?
Overall, I think what's happening is some kind of contention between an incomplete finalization, a truncation, and a second finalization.
What's going on immediately before the crash: n1 is a standby and starts to finalize an open segment ending with index 674. Then it's converted to being a voter, drops from FOLLOWER to UNAVAILABLE, and decides to truncate its log because it previously saw a mismatch at index 674. There is a rash of I/O failures that are ignored because they originated with the finalization attempt that started when n1 was still a follower/non-voter. The last log line before the assertion failure is
src/uv_finalize.c:39 finalize open-7 into 0000000000000674-0000000000000000
The fact that the end index is 0 looks ominous to me. Is that supposed to happen?
Not quite sure about the rest, since it looks like a tricky scenario, but no, this is not supposed to happen.
Taking another look at this, it seems that last_index
will be zero whenever we finalize a segment that hasn't actually been written to -- which is why the used > 0
check is there. Could still have something to do with what's going wrong, but I bet there's more to the story.
Revised sequence of events, after spending some more time with the code:
It's possible to see from the log that the truncate went through, because the last closed segment left after the crash ends with index 673.
Going to try to write an integration test that reproduces this error, otherwise it will be pretty hard to debug.
This happened again here (thank you, Jepsen!). There's a backtrace this time but it doesn't add much information. Relevant bit of the app log:
LIBRAFT 1670437954304014556 src/replication.c:1005 log mismatch -> truncate (3195)
LIBRAFT 1670437954304043356 src/uv_prepare.c:164 create open segment open-4
[snip]
LIBRAFT 1670437954304130957 src/uv_finalize.c:39 finalize open-1 into 0000000000003145-0000000000003195
[snip]
LIBDQLITE 1670437954307156585 dqlite_node_stop:753 dqlite node stop
[snip]
LIBRAFT 1670437954334910938 src/uv_prepare.c:230 completed creation of open-4
LIBDQLITE 1670437954334921938 conn__stop:330 conn stop
LIBDQLITE 1670437954334924938 gateway__close:80 gateway close
LIBRAFT 1670437954334977038 src/convert.c:38 clear follower state
LIBRAFT 1670437954334981638 src/convert.c:23 old_state:1 new_state:0
LIBRAFT 1670437954335139040 src/replication.c:842 I/O completed on follower: status 13
LIBRAFT 1670437954335142840 src/replication.c:850 local server is not follower -> ignore I/O failure
LIBRAFT 1670437954335146140 src/replication.c:842 I/O completed on follower: status 13
LIBRAFT 1670437954335148240 src/replication.c:850 local server is not follower -> ignore I/O failure
LIBRAFT 1670437954335150640 src/replication.c:842 I/O completed on follower: status 13
LIBRAFT 1670437954335152640 src/replication.c:850 local server is not follower -> ignore I/O failure
LIBRAFT 1670437954335155040 src/replication.c:842 I/O completed on follower: status 13
LIBRAFT 1670437954335157040 src/replication.c:850 local server is not follower -> ignore I/O failure
LIBRAFT 1670437954335159640 src/replication.c:842 I/O completed on follower: status 13
LIBRAFT 1670437954335161740 src/replication.c:850 local server is not follower -> ignore I/O failure
LIBRAFT 1670437954335164140 src/replication.c:842 I/O completed on follower: status 13
LIBRAFT 1670437954335170040 src/replication.c:850 local server is not follower -> ignore I/O failure
LIBRAFT 1670437954335172640 src/replication.c:842 I/O completed on follower: status 13
LIBRAFT 1670437954335174640 src/replication.c:850 local server is not follower -> ignore I/O failure
LIBRAFT 1670437954335176940 src/replication.c:842 I/O completed on follower: status 13
LIBRAFT 1670437954335179040 src/replication.c:850 local server is not follower -> ignore I/O failure
LIBRAFT 1670437954335181340 src/replication.c:842 I/O completed on follower: status 13
LIBRAFT 1670437954335183340 src/replication.c:850 local server is not follower -> ignore I/O failure
LIBDQLITE 1670437954335196240 impl_close:208 impl close
LIBRAFT 1670437954335225640 src/uv_finalize.c:39 finalize open-2 into 0000000000003195-0000000000000000
LIBRAFT 1670437954335159040 src/uv_list.c:92 segment 0000000000002019-0000000000002128
LIBRAFT 1670437954335631244 src/uv_list.c:92 segment 0000000000002129-0000000000002256
LIBRAFT 1670437954335637044 src/uv_list.c:92 segment 0000000000002257-0000000000002384
LIBRAFT 1670437954335640344 src/uv_list.c:92 segment 0000000000002385-0000000000002488
LIBRAFT 1670437954335643244 src/uv_list.c:92 segment 0000000000002489-0000000000002511
LIBRAFT 1670437954335646044 src/uv_list.c:92 segment 0000000000002512-0000000000002638
LIBRAFT 1670437954335648844 src/uv_list.c:92 segment 0000000000002639-0000000000002708
LIBRAFT 1670437954335651744 src/uv_list.c:92 segment 0000000000002709-0000000000002836
LIBRAFT 1670437954335654644 src/uv_list.c:92 segment 0000000000002837-0000000000002964
LIBRAFT 1670437954335657544 src/uv_list.c:92 segment 0000000000002965-0000000000002982
LIBRAFT 1670437954335660244 src/uv_list.c:92 segment 0000000000002983-0000000000003020
LIBRAFT 1670437954335663144 src/uv_list.c:92 segment 0000000000003021-0000000000003092
LIBRAFT 1670437954335666044 src/uv_list.c:92 segment 0000000000003093-0000000000003144
LIBRAFT 1670437954335677145 src/uv_list.c:92 segment 0000000000003145-0000000000003195
LIBRAFT 1670437954335680145 src/uv_list.c:97 ignore cluster.yaml
LIBRAFT 1670437954335682745 src/uv_list.c:97 ignore info.yaml
LIBRAFT 1670437954335685245 src/uv_list.c:71 ignore metadata1
LIBRAFT 1670437954335687545 src/uv_list.c:71 ignore metadata2
LIBRAFT 1670437954335690545 src/uv_list.c:92 segment open-2
LIBRAFT 1670437954335693445 src/uv_list.c:92 segment open-4
LIBRAFT 1670437954335696545 src/uv_list.c:97 ignore snapshot-23-2964-752890
LIBRAFT 1670437954335708245 src/uv_list.c:82 snapshot snapshot-23-2964-752890.meta
LIBRAFT 1670437954335711345 src/uv_list.c:97 ignore snapshot-25-3092-764473
LIBRAFT 1670437954335717945 src/uv_list.c:82 snapshot snapshot-25-3092-764473.meta
LIBRAFT 1670437954335726545 src/uv_segment.c:1084 truncate 3145-3195 at 3195
0x7f43b53514d9 uvMaybeFireCloseCb
src/uv.c:197
0x7f43b53514d9 uvMaybeFireCloseCb
src/uv.c:157
0x7f43b53ba9e4 uv__work_done
src/threadpool.c:313
0x7f43b53bde07 uv__async_io
src/unix/async.c:147
0x7f43b53ceaaf uv__io_poll
src/unix/linux-core.c:431
0x7f43b53be7ab uv_run
src/unix/core.c:375
0x7f43b574838f taskRun
src/server.c:632
0x7f43b574838f taskStart
src/server.c:658
0x7f43b58e1608 ???
???:0
0x7f43b5655132 ???
???:0
0xffffffffffffffff ???
???:0
app: src/uv.c:197: uvMaybeFireCloseCb: Assertion `uv->truncate_work.data == NULL' failed.
It's a similar sequence of events: we start truncating an entry that belongs to an open segment, then concurrently we start finalizing that segment, and then we start shutting down the raft node on top of that.
Having trouble reproducing this (Jepsen is not cooperating), but my working hypothesis is that uv->barrier is getting cleared by something else -- maybe the finalization job? -- other than uvTruncateAfterWorkCb, so that the inference "barrier cleared => truncate work has finished" doesn't hold. See the commit that introduced this assertion, https://github.com/canonical/raft/commit/81239d869910b6f29dbc8b029d30fd142607cb1c.
Okay, I finally reproduced this in a local Jepsen run with some additional tracing enabled, and I think I can see what's going on:
dqlite_node_stop is called, which eventually leads to calling uvBarrierClose and this code:
The barrier from the truncate task is still set, so we run uvTruncateBarrierCb again. This time it notices that the node is shutting down and does nothing. uvBarrierClose clears uv->barrier, but uv->truncate_work.data is still non-NULL, because uvTruncateAfterWorkCb hasn't run yet.
Still thinking about the best way to fix this, I want to understand better all the different ways that uv->barrier is used.
This might be a fundamental issue with uvBarrierClose. The logic of that function seems to be that we can cancel any outstanding barriers by invoking their callbacks and then clearing the struct fields that refer to those barriers; this would work if uv->barrier etc. were always cleared when we run the callback passed to UvBarrier. But in practice, the truncate operation doesn't call UvUnblock until much later, when the "after work" callback is invoked -- and the snapshot operation works the same way. So there's a potentially big window where the barrier callback has already been invoked once (so we shouldn't invoke it again), but the barrier is still "live" and there's work waiting on libuv's queue that will clear it when it's finished. @freeekanayaka @MathieuBordere does that analysis make sense to you? Any potential solutions come to mind?
Closing since I believe this was addressed by #334
Observed during Jepsen tests: Run Artifact