One of the followers ended up in an invalid state, and we can see from the following logs what caused it:
2023.03.14 06:09:10.682934 [ 2934 ] {} <Debug> RaftInstance: Receive a install_snapshot_request message from 1 with LastLogIndex=1700, LastLogTerm=1, EntriesLength=1, CommitIndex=1715 and Term=1
2023.03.14 06:09:10.682943 [ 2936 ] {} <Trace> RaftInstance: commit upto 1704, current idx 1688
2023.03.14 06:09:10.682944 [ 2934 ] {} <Trace> RaftInstance: cancel existing timer
2023.03.14 06:09:10.682959 [ 2936 ] {} <Trace> RaftInstance: commit upto 1704, current idx 1689
2023.03.14 06:09:10.682964 [ 2934 ] {} <Trace> RaftInstance: re-schedule election timer
2023.03.14 06:09:10.682975 [ 2936 ] {} <Trace> RaftInstance: commit upto 1704, current idx 1690
2023.03.14 06:09:10.682988 [ 2936 ] {} <Trace> RaftInstance: commit upto 1704, current idx 1691
2023.03.14 06:09:10.682996 [ 2934 ] {} <Information> RaftInstance: save snapshot (idx 1700, term 1) offset 0x0, first obj last obj
2023.03.14 06:09:10.683000 [ 2936 ] {} <Trace> RaftInstance: commit upto 1704, current idx 1692
2023.03.14 06:09:10.683006 [ 2934 ] {} <Debug> KeeperStateMachine: Saving snapshot 1700 obj_id 0
2023.03.14 06:09:10.683016 [ 2936 ] {} <Trace> RaftInstance: commit upto 1704, current idx 1693
2023.03.14 06:09:10.683028 [ 2936 ] {} <Trace> RaftInstance: commit upto 1704, current idx 1694
2023.03.14 06:09:10.683057 [ 2936 ] {} <Trace> RaftInstance: commit upto 1704, current idx 1695
2023.03.14 06:09:10.683069 [ 2936 ] {} <Trace> RaftInstance: commit upto 1704, current idx 1696
2023.03.14 06:09:10.683080 [ 2936 ] {} <Trace> RaftInstance: commit upto 1704, current idx 1697
2023.03.14 06:09:10.683094 [ 2936 ] {} <Trace> RaftInstance: commit upto 1704, current idx 1698
2023.03.14 06:09:10.683120 [ 2936 ] {} <Trace> RaftInstance: commit upto 1704, current idx 1699
2023.03.14 06:09:10.683136 [ 2936 ] {} <Trace> RaftInstance: commit upto 1704, current idx 1700
2023.03.14 06:09:10.683148 [ 2936 ] {} <Information> RaftInstance: creating a snapshot for index 1700
2023.03.14 06:09:10.683153 [ 2936 ] {} <Information> RaftInstance: create snapshot idx 1700 log_term 1
2023.03.14 06:09:10.683158 [ 2936 ] {} <Debug> KeeperStateMachine: Creating snapshot 1700
2023.03.14 06:09:10.683184 [ 2936 ] {} <Debug> KeeperStateMachine: In memory snapshot 1700 created, queueing task to flush to disk
2023.03.14 06:09:10.683194 [ 2936 ] {} <Information> RaftInstance: create snapshot idx 1700 log_term 1 done: 36 us elapsed
2023.03.14 06:09:10.683200 [ 2936 ] {} <Trace> RaftInstance: commit upto 1704, current idx 1701
2023.03.14 06:09:10.683213 [ 2936 ] {} <Trace> RaftInstance: commit upto 1704, current idx 1702
2023.03.14 06:09:10.683231 [ 2936 ] {} <Trace> RaftInstance: commit upto 1704, current idx 1703
2023.03.14 06:09:10.683242 [ 2936 ] {} <Trace> RaftInstance: commit upto 1704, current idx 1704
2023.03.14 06:09:10.683254 [ 2936 ] {} <Debug> RaftInstance: DONE: commit upto 1704, current idx 1704
2023.03.14 06:09:10.683259 [ 2936 ] {} <Trace> RaftInstance: commit_cv_ sleep
2023.03.14 06:09:10.683427 [ 2932 ] {} <Trace> RaftInstance: release connection 0x7fafaba55018, idle 1, in-use 0, max 10
2023.03.14 06:09:10.683445 [ 2925 ] {} <Trace> RaftInstance: auto forwarding pkg for leader 1 exists 0x7fafaba42018
2023.03.14 06:09:10.683452 [ 2925 ] {} <Trace> RaftInstance: idle connection 0x7fafaba55018
2023.03.14 06:09:10.684925 [ 2935 ] {} <Trace> RaftInstance: release connection 0x7fafaba55018, idle 1, in-use 0, max 10
2023.03.14 06:09:10.684940 [ 2925 ] {} <Trace> RaftInstance: auto forwarding pkg for leader 1 exists 0x7fafaba42018
2023.03.14 06:09:10.685430 [ 2925 ] {} <Trace> RaftInstance: idle connection 0x7fafaba55018
2023.03.14 06:09:10.685979 [ 2934 ] {} <Debug> KeeperStateMachine: Saved snapshot 1700 to path /home/robot-clickhouse/db/coordination/snapshots/snapshot_1700.bin.zstd
2023.03.14 06:09:10.685992 [ 2934 ] {} <Information> RaftInstance: pause state machine execution, previously ACTIVE, state machine SLEEPING, timeout 0 ms
2023.03.14 06:09:10.685997 [ 2934 ] {} <Information> RaftInstance: successfully receive a snapshot (idx 1700 term 1) from leader
2023.03.14 06:09:10.686001 [ 2934 ] {} <Information> KeeperLogStore: Compact logs up to log index 1700, our max log id is 1705
2023.03.14 06:09:10.686040 [ 2934 ] {} <Information> KeeperLogStore: Compaction up to 1700 finished new min index 1701, new max index 1705
2023.03.14 06:09:10.686050 [ 2934 ] {} <Information> RaftInstance: successfully compact the log store, will now ask the statemachine to apply the snapshot
2023.03.14 06:09:10.686056 [ 2934 ] {} <Debug> KeeperStateMachine: Applying snapshot 1700
While the commit from previous append was in progress we hit an edge case where the follower started receiving a request for install snapshot at the same time.
we commit index 1691 from append logs
we receive install snapshot request with index 1700
we check sm_commit_idx (1691) and it is smaller then 1700
while receiving snapshot we commit up to 1704
we apply snapshot with data up to 1700
We lose preprocess of 4 logs.
I think that simply changing the check to use quick_commit_index_ is enough because the intent to commit something should already be enough.
Additional options are to delete logs from snapshot index point because if we received that snapshot index it means data is already durable on enough nodes (otherwise it wouldn't be committed on leader) but that seems too much complications if it can be fixed with this simple change.
@greensky00 WDYT?
Our Jepsen tests for ClickHouse Keeper caught a issue with installing snapshot on a follower. Link to the report: https://s3.amazonaws.com/clickhouse-test-reports/0/d3a514d22175c3f756a969d99523b7472011d04a/clickhouse_keeper_jepsen.html
One of the followers ended up in an invalid state, and we can see from the following logs what caused it:
While the commit from previous append was in progress we hit an edge case where the follower started receiving a request for install snapshot at the same time.
We lose preprocess of 4 logs.
I think that simply changing the check to use
quick_commit_index_
is enough because the intent to commit something should already be enough. Additional options are to delete logs from snapshot index point because if we received that snapshot index it means data is already durable on enough nodes (otherwise it wouldn't be committed on leader) but that seems too much complications if it can be fixed with this simple change. @greensky00 WDYT?