Closed cole-miller closed 1 month ago
Attention: Patch coverage is 81.85185%
with 49 lines
in your changes missing coverage. Please review.
Project coverage is 81.07%. Comparing base (
4e328c9
) to head (85e494d
). Report is 31 commits behind head on master.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Pictures!
Leader's view of executing a write transaction in a one-node cluster (./integration-test cluster/restart
):
From top to bottom, we have the leader__exec
request, the raft_apply
request, the new log entry, and the UvAppend
request. You can see how long it took to executing the transaction initially, how long it took to write it to disk, and how long it took to apply it once committed.
Follower's view of handling an AppendEntries (./integration-test cluster/dataOnNewNode
):
At the top is the main appendFollower
request. Below that are 10 log entries, and at the bottom is the disk write for the entries.
Finally, a different view of AppendEntries handling on the follower's side. This one uses the fixture (./raft-core-integration-test replication/recvRollbackConfigurationToInitial
), so the I/O related state machines are simpler.
Here, the follower truncated an entry from its log because of the AppendEntries. You see the appendFollower
request at the top, then the truncated entry, then the disk truncate, then the two entries that were included in the AppendEntries, then the disk append. You can see that the truncate and append operations are concurrent---replicationApply
doesn't wait for the former to finish before kicking off the latter (something I hadn't noticed before I put together this PR). If we used the real I/O backend and instrumented a bit more we'd see that UvTruncate
sets a UvBarrier
which blocks the UvAppend
from running until the truncate has finished.
This PR introduces several state machines to track and instrument the behavior of existing dqlite code. Specifically, we instrument:
leader__exec
)COMMAND
entry to the raft log (raft_apply
)replicationAppend
)log.c
)UvAppend
)UvTruncate
)There are lots more things we could instrument but I think this is a good starting point: it gives us enough visibility to follow a write transaction over its whole lifecycle and down to the disk I/O level, on both the follower and the leader. (A big missing piece is linking the histories across nodes; that's nontrivial because our raft messages don't include any kind of ID or room for extensibility, although we could fake something in the I/O fixture.)
The tracking of in-memory log entries is the trickiest part of this PR. I was initially uncertain whether to attach SMs to individual log entries at all, but I found this served as a convenient "hub" to connect other state machines together (e.g. bridging the
raft_apply
andUvAppend
state machines), plus it gives a foothold for tracking how long it takes to apply each entry.I also added state machines for the append and truncate requests in the raft I/O fixture (
raft/fixture.c
). This was necessary because the code inraft/replication.c
now assumes that the I/O backend makes an SM available to callsm_relate
with. The SMs in the fixture are not copies of the ones in the uv I/O backend, but simpler ones.Finally, I made a few fixes and additions to the sm code.