In one of our tests, NuRaft ended up creating 2 snapshots with a distance of 1 log.
I didn't find anything strange in the logs:
2024.09.28 03:34:06.424741 [ 606 ] {} <Debug> RaftInstance: commit upto 50, current idx 49
2024.09.28 03:34:06.424773 [ 606 ] {} <Trace> RaftInstance: commit upto 50, current idx 50
2024.09.28 03:34:06.424871 [ 606 ] {} <Information> RaftInstance: creating a snapshot for index 50
2024.09.28 03:34:06.424891 [ 606 ] {} <Information> RaftInstance: create snapshot idx 50 log_term 1
2024.09.28 03:34:06.424897 [ 605 ] {} <Debug> RaftInstance: Receive a append_entries_response message from peer 3 with Result=1, Term=1, NextIndex=51
2024.09.28 03:34:06.425021 [ 596 ] {} <Debug> RaftInstance: Receive a append_entries_response message from peer 2 with Result=1, Term=1, NextIndex=51
2024.09.28 03:34:06.425072 [ 596 ] {} <Trace> RaftInstance: peer 2, prev matched idx: 49, new matched idx: 50
2024.09.28 03:34:06.425104 [ 596 ] {} <Debug> RaftInstance: reqeust append entries need to catchup, p 2
2024.09.28 03:34:06.425130 [ 596 ] {} <Debug> RaftInstance: append_entries for 2 with LastLogIndex=50, LastLogTerm=1, EntriesLength=0, CommitIndex=50, Term=1, peer_last_sent_idx 50
2024.09.28 03:34:06.425140 [ 606 ] {} <Information> RaftInstance: create snapshot idx 50 log_term 1 done: 221 us elapsed
2024.09.28 03:34:06.425165 [ 606 ] {} <Debug> RaftInstance: DONE: commit upto 50, current idx 50
2024.09.28 03:34:06.425520 [ 592 ] {} <Debug> RaftInstance: Receive a append_entries_response message from peer 2 with Result=1, Term=1, NextIndex=51
2024.09.28 03:34:06.425991 [ 583 ] {} <Debug> RaftInstance: Receive a client_request message from 0 with LastLogIndex=0, LastLogTerm 0, EntriesLength=1, CommitIndex=0 and Term=0
2024.09.28 03:34:06.426080 [ 583 ] {} <Debug> RaftInstance: append at log_idx 51, timestamp 1727494446426011
2024.09.28 03:34:06.426120 [ 583 ] {} <Debug> RaftInstance: asynchronously replicated 51, return value 0x503000827260
2024.09.28 03:34:06.426310 [ 607 ] {} <Debug> RaftInstance: append_entries for 3 with LastLogIndex=50, LastLogTerm=1, EntriesLength=1, CommitIndex=50, Term=1, peer_last_sent_idx 51
2024.09.28 03:34:06.426435 [ 607 ] {} <Debug> RaftInstance: append_entries for 2 with LastLogIndex=50, LastLogTerm=1, EntriesLength=1, CommitIndex=50, Term=1, peer_last_sent_idx 51
2024.09.28 03:34:06.429853 [ 601 ] {} <Debug> RaftInstance: Receive a append_entries_response message from peer 2 with Result=1, Term=1, NextIndex=52
2024.09.28 03:34:06.429906 [ 601 ] {} <Trace> RaftInstance: peer 2, prev matched idx: 50, new matched idx: 51
2024.09.28 03:34:06.429984 [ 601 ] {} <Debug> RaftInstance: append_entries for 3 with LastLogIndex=50, LastLogTerm=1, EntriesLength=1, CommitIndex=51, Term=1, peer_last_sent_idx 51
2024.09.28 03:34:06.430093 [ 601 ] {} <Debug> RaftInstance: append_entries for 2 with LastLogIndex=51, LastLogTerm=1, EntriesLength=0, CommitIndex=51, Term=1, peer_last_sent_idx 51
2024.09.28 03:34:06.430190 [ 606 ] {} <Debug> RaftInstance: commit upto 51, current idx 50
2024.09.28 03:34:06.430207 [ 604 ] {} <Debug> RaftInstance: Receive a append_entries_response message from peer 3 with Result=1, Term=1, NextIndex=52
2024.09.28 03:34:06.430225 [ 606 ] {} <Trace> RaftInstance: commit upto 51, current idx 51
2024.09.28 03:34:06.430262 [ 604 ] {} <Trace> RaftInstance: peer 3, prev matched idx: 50, new matched idx: 51
2024.09.28 03:34:06.430324 [ 582 ] {} <Information> RaftInstance: snapshot idx 50 log_term 1 created, compact the log store if needed
2024.09.28 03:34:06.430391 [ 591 ] {} <Debug> RaftInstance: Receive a append_entries_response message from peer 2 with Result=1, Term=1, NextIndex=52
2024.09.28 03:34:06.430444 [ 606 ] {} <Information> RaftInstance: creating a snapshot for index 51
2024.09.28 03:34:06.430465 [ 606 ] {} <Information> RaftInstance: create snapshot idx 51 log_term 1
2024.09.28 03:34:06.430502 [ 590 ] {} <Debug> RaftInstance: Receive a append_entries_response message from peer 3 with Result=1, Term=1, NextIndex=52
2024.09.28 03:34:06.430576 [ 606 ] {} <Information> RaftInstance: create snapshot idx 51 log_term 1 done: 86 us elapsed
2024.09.28 03:34:06.430596 [ 606 ] {} <Debug> RaftInstance: DONE: commit upto 51, current idx 51
2024.09.28 03:34:06.431433 [ 583 ] {} <Debug> RaftInstance: Receive a client_request message from 0 with LastLogIndex=0, LastLogTerm 0, EntriesLength=1, CommitIndex=0 and Term=0
2024.09.28 03:34:06.431547 [ 583 ] {} <Debug> RaftInstance: append at log_idx 52, timestamp 1727494446431453
2024.09.28 03:34:06.431598 [ 583 ] {} <Debug> RaftInstance: asynchronously replicated 52, return value 0x5030008273e0
2024.09.28 03:34:06.431686 [ 607 ] {} <Debug> RaftInstance: append_entries for 3 with LastLogIndex=51, LastLogTerm=1, EntriesLength=1, CommitIndex=51, Term=1, peer_last_sent_idx 51
2024.09.28 03:34:06.431817 [ 607 ] {} <Debug> RaftInstance: append_entries for 2 with LastLogIndex=51, LastLogTerm=1, EntriesLength=1, CommitIndex=51, Term=1, peer_last_sent_idx 52
2024.09.28 03:34:06.435829 [ 582 ] {} <Information> RaftInstance: snapshot idx 51 log_term 1 created, compact the log store if needed
The only explanation I see is a small race. If snp_in_progress_ is true, local_snp can change while we run checks before starting snapshot creation.
We need to fetch local_snp again after we are sure we are the only possible thread that is creating a snapshot.
In one of our tests, NuRaft ended up creating 2 snapshots with a distance of 1 log.
I didn't find anything strange in the logs:
The only explanation I see is a small race. If
snp_in_progress_
istrue
,local_snp
can change while we run checks before starting snapshot creation. We need to fetchlocal_snp
again after we are sure we are the only possible thread that is creating a snapshot.