canonical / raft

Unmaintained C implementation of the Raft consensus protocol
Other
939 stars 136 forks source link

Assertion Failure: src/uv.c:197: uvMaybeFireCloseCb: Assertion `uv->truncate_work.data == NULL' failed." #324

Closed MathieuBordere closed 1 year ago

MathieuBordere commented 1 year ago

Observed during Jepsen tests: Run Artifact

cole-miller commented 1 year ago

What's going on immediately before the crash: n1 is a standby and starts to finalize an open segment ending with index 674. Then it's converted to being a voter, drops from FOLLOWER to UNAVAILABLE, and decides to truncate its log because it previously saw a mismatch at index 674. There is a rash of I/O failures that are ignored because they originated with the finalization attempt that started when n1 was still a follower/non-voter. The last log line before the assertion failure is

src/uv_finalize.c:39 finalize open-7 into 0000000000000674-0000000000000000

The fact that the end index is 0 looks ominous to me. Is that supposed to happen?

Overall, I think what's happening is some kind of contention between an incomplete finalization, a truncation, and a second finalization.

freeekanayaka commented 1 year ago

What's going on immediately before the crash: n1 is a standby and starts to finalize an open segment ending with index 674. Then it's converted to being a voter, drops from FOLLOWER to UNAVAILABLE, and decides to truncate its log because it previously saw a mismatch at index 674. There is a rash of I/O failures that are ignored because they originated with the finalization attempt that started when n1 was still a follower/non-voter. The last log line before the assertion failure is

src/uv_finalize.c:39 finalize open-7 into 0000000000000674-0000000000000000

The fact that the end index is 0 looks ominous to me. Is that supposed to happen?

Not quite sure about the rest, since it looks like a tricky scenario, but no, this is not supposed to happen.

cole-miller commented 1 year ago

Taking another look at this, it seems that last_index will be zero whenever we finalize a segment that hasn't actually been written to -- which is why the used > 0 check is there. Could still have something to do with what's going wrong, but I bet there's more to the story.

cole-miller commented 1 year ago

Revised sequence of events, after spending some more time with the code:

  1. an APPEND_ENTRIES request prompts us to truncate away log entry 674
  2. we start finalizing the open segment containing entry 674
  3. raft_close is called
  4. the truncation request kicks off (uvSegmentTruncate)
  5. some outstanding append requests get cancelled due to the raft_close
  6. an open segment that hadn't yet been written to gets "finalized" (= we just delete it) -- note, the starting index of this segment is also 674, but I'm pretty sure it corresponds to one of those cancelled append requests and not to the entry that was truncated away from index 674 previously

It's possible to see from the log that the truncate went through, because the last closed segment left after the crash ends with index 673.

Going to try to write an integration test that reproduces this error, otherwise it will be pretty hard to debug.

cole-miller commented 1 year ago

This happened again here (thank you, Jepsen!). There's a backtrace this time but it doesn't add much information. Relevant bit of the app log:

LIBRAFT   1670437954304014556 src/replication.c:1005 log mismatch -> truncate (3195)
LIBRAFT   1670437954304043356 src/uv_prepare.c:164 create open segment open-4
[snip]
LIBRAFT   1670437954304130957 src/uv_finalize.c:39 finalize open-1 into 0000000000003145-0000000000003195
[snip]
LIBDQLITE 1670437954307156585 dqlite_node_stop:753 dqlite node stop
[snip]
LIBRAFT   1670437954334910938 src/uv_prepare.c:230 completed creation of open-4
LIBDQLITE 1670437954334921938 conn__stop:330 conn stop
LIBDQLITE 1670437954334924938 gateway__close:80 gateway close
LIBRAFT   1670437954334977038 src/convert.c:38 clear follower state
LIBRAFT   1670437954334981638 src/convert.c:23 old_state:1 new_state:0
LIBRAFT   1670437954335139040 src/replication.c:842 I/O completed on follower: status 13
LIBRAFT   1670437954335142840 src/replication.c:850 local server is not follower -> ignore I/O failure
LIBRAFT   1670437954335146140 src/replication.c:842 I/O completed on follower: status 13
LIBRAFT   1670437954335148240 src/replication.c:850 local server is not follower -> ignore I/O failure
LIBRAFT   1670437954335150640 src/replication.c:842 I/O completed on follower: status 13
LIBRAFT   1670437954335152640 src/replication.c:850 local server is not follower -> ignore I/O failure
LIBRAFT   1670437954335155040 src/replication.c:842 I/O completed on follower: status 13
LIBRAFT   1670437954335157040 src/replication.c:850 local server is not follower -> ignore I/O failure
LIBRAFT   1670437954335159640 src/replication.c:842 I/O completed on follower: status 13
LIBRAFT   1670437954335161740 src/replication.c:850 local server is not follower -> ignore I/O failure
LIBRAFT   1670437954335164140 src/replication.c:842 I/O completed on follower: status 13
LIBRAFT   1670437954335170040 src/replication.c:850 local server is not follower -> ignore I/O failure
LIBRAFT   1670437954335172640 src/replication.c:842 I/O completed on follower: status 13
LIBRAFT   1670437954335174640 src/replication.c:850 local server is not follower -> ignore I/O failure
LIBRAFT   1670437954335176940 src/replication.c:842 I/O completed on follower: status 13
LIBRAFT   1670437954335179040 src/replication.c:850 local server is not follower -> ignore I/O failure
LIBRAFT   1670437954335181340 src/replication.c:842 I/O completed on follower: status 13
LIBRAFT   1670437954335183340 src/replication.c:850 local server is not follower -> ignore I/O failure
LIBDQLITE 1670437954335196240 impl_close:208 impl close
LIBRAFT   1670437954335225640 src/uv_finalize.c:39 finalize open-2 into 0000000000003195-0000000000000000
LIBRAFT   1670437954335159040 src/uv_list.c:92 segment 0000000000002019-0000000000002128
LIBRAFT   1670437954335631244 src/uv_list.c:92 segment 0000000000002129-0000000000002256
LIBRAFT   1670437954335637044 src/uv_list.c:92 segment 0000000000002257-0000000000002384
LIBRAFT   1670437954335640344 src/uv_list.c:92 segment 0000000000002385-0000000000002488
LIBRAFT   1670437954335643244 src/uv_list.c:92 segment 0000000000002489-0000000000002511
LIBRAFT   1670437954335646044 src/uv_list.c:92 segment 0000000000002512-0000000000002638
LIBRAFT   1670437954335648844 src/uv_list.c:92 segment 0000000000002639-0000000000002708
LIBRAFT   1670437954335651744 src/uv_list.c:92 segment 0000000000002709-0000000000002836
LIBRAFT   1670437954335654644 src/uv_list.c:92 segment 0000000000002837-0000000000002964
LIBRAFT   1670437954335657544 src/uv_list.c:92 segment 0000000000002965-0000000000002982
LIBRAFT   1670437954335660244 src/uv_list.c:92 segment 0000000000002983-0000000000003020
LIBRAFT   1670437954335663144 src/uv_list.c:92 segment 0000000000003021-0000000000003092
LIBRAFT   1670437954335666044 src/uv_list.c:92 segment 0000000000003093-0000000000003144
LIBRAFT   1670437954335677145 src/uv_list.c:92 segment 0000000000003145-0000000000003195
LIBRAFT   1670437954335680145 src/uv_list.c:97 ignore cluster.yaml
LIBRAFT   1670437954335682745 src/uv_list.c:97 ignore info.yaml
LIBRAFT   1670437954335685245 src/uv_list.c:71 ignore metadata1
LIBRAFT   1670437954335687545 src/uv_list.c:71 ignore metadata2
LIBRAFT   1670437954335690545 src/uv_list.c:92 segment open-2
LIBRAFT   1670437954335693445 src/uv_list.c:92 segment open-4
LIBRAFT   1670437954335696545 src/uv_list.c:97 ignore snapshot-23-2964-752890
LIBRAFT   1670437954335708245 src/uv_list.c:82 snapshot snapshot-23-2964-752890.meta
LIBRAFT   1670437954335711345 src/uv_list.c:97 ignore snapshot-25-3092-764473
LIBRAFT   1670437954335717945 src/uv_list.c:82 snapshot snapshot-25-3092-764473.meta
LIBRAFT   1670437954335726545 src/uv_segment.c:1084 truncate 3145-3195 at 3195
0x7f43b53514d9 uvMaybeFireCloseCb
    src/uv.c:197
0x7f43b53514d9 uvMaybeFireCloseCb
    src/uv.c:157
0x7f43b53ba9e4 uv__work_done
    src/threadpool.c:313
0x7f43b53bde07 uv__async_io
    src/unix/async.c:147
0x7f43b53ceaaf uv__io_poll
    src/unix/linux-core.c:431
0x7f43b53be7ab uv_run
    src/unix/core.c:375
0x7f43b574838f taskRun
    src/server.c:632
0x7f43b574838f taskStart
    src/server.c:658
0x7f43b58e1608 ???
    ???:0
0x7f43b5655132 ???
    ???:0
0xffffffffffffffff ???
    ???:0
app: src/uv.c:197: uvMaybeFireCloseCb: Assertion `uv->truncate_work.data == NULL' failed.

It's a similar sequence of events: we start truncating an entry that belongs to an open segment, then concurrently we start finalizing that segment, and then we start shutting down the raft node on top of that.

cole-miller commented 1 year ago

Having trouble reproducing this (Jepsen is not cooperating), but my working hypothesis is that uv->barrier is getting cleared by something else -- maybe the finalization job? -- other than uvTruncateAfterWorkCb, so that the inference "barrier cleared => truncate work has finished" doesn't hold. See the commit that introduced this assertion, https://github.com/canonical/raft/commit/81239d869910b6f29dbc8b029d30fd142607cb1c.

cole-miller commented 1 year ago

Okay, I finally reproduced this in a local Jepsen run with some additional tracing enabled, and I think I can see what's going on:

Still thinking about the best way to fix this, I want to understand better all the different ways that uv->barrier is used.

cole-miller commented 1 year ago

This might be a fundamental issue with uvBarrierClose. The logic of that function seems to be that we can cancel any outstanding barriers by invoking their callbacks and then clearing the struct fields that refer to those barriers; this would work if uv->barrier etc. were always cleared when we run the callback passed to UvBarrier. But in practice, the truncate operation doesn't call UvUnblock until much later, when the "after work" callback is invoked -- and the snapshot operation works the same way. So there's a potentially big window where the barrier callback has already been invoked once (so we shouldn't invoke it again), but the barrier is still "live" and there's work waiting on libuv's queue that will clear it when it's finished. @freeekanayaka @MathieuBordere does that analysis make sense to you? Any potential solutions come to mind?

cole-miller commented 1 year ago

Closing since I believe this was addressed by #334