dotnet / dotNext

Next generation API for .NET
https://dotnet.github.io/dotNext/
MIT License
1.62k stars 122 forks source link

Method DotNext.Net.Cluster.Consensus.Raft.LeaderState<TMember>.Replicator.ReadAsync is incorrectly returned "false" despite on the replication is successful #147

Closed alex-urm closed 1 year ago

alex-urm commented 1 year ago

Brief description: DoHeartbeats send empty list to replicate. The list is successfully replicated but the replication is considered failed because no items were actually replicated. As the result the incorrect "Unable to commit local changes" message is logged and no auditTrail.CommitAsync is called.

Details: During heartbeat an empty entries list is replicating: The ReplicateAsync method (in DotNext.Net.Cluster.Consensus.Raft.LeaderState.Replicator):

            var result = currentIndex >= startIndex ?
                auditTrail.ReadAsync(this, startIndex, token) :
                ReadAsync<EmptyLogEntry, EmptyLogEntry[]>(Array.Empty<EmptyLogEntry>(), null, token); // The empty entries list is passed here

When the replication is successfully completed (with result "true") the Complete() method is called:

                if (result.Value)
                {
                    // Member http://localhost:32640/cluster/raft is replicated successfully starting from index 3
                    logger.ReplicationSuccessful(Member.EndPoint, Member.NextIndex);
                    Member.NextIndex = replicationIndex + 1L;
                    Member.ConfigurationFingerprint = fingerprint;
                    result = result with { Value = replicatedWithCurrentTerm };
                }

Note that the returned value (which is "true" is substituted with the replicatedWithCurrentTerm variable which is "false" because of the ReplicateEntries<TEntry, TList> method:

            for (var i = 0; i < entries.Count; i++) // the entries.Count == 0
            {
                if (entries[i].Term == term)
                {
                    replicatedWithCurrentTerm = true;
                    return;
                }
            }
            replicatedWithCurrentTerm = false; // So we execute this code

As the result the value returned from the Complete() method becomes false (because the replicatedWithCurrentTerm == "false"). The "false" value is interpreted as a failure in the ProcessMemberResponse method (DotNext.Net.Cluster.Consensus.Raft.LeaderState) and lead to decreasing the commitQuorum variable.

And finally in the DoHeartbeats method the zero value of the commitQuorum variable is interpreted as inability to replicate state and this is logged as "Unable to commit local changes. Quorum is {Quorum}, last committed entry is {CommitIndex}"

How to fix: If this is an intentional behavior - replace the log entry to less scary text when there is nothing to replicate. If this is not the intentional behavior - make sure that the local changes are committed

alex-urm commented 1 year ago

Research is not completed although the issue is real

alex-urm commented 1 year ago

Research completed. DoHeartbeats send empty list to replicate. The list is successfully replicated but the replication is considered failed because no items were actually replicated. As the result the incorrect "Unable to commit local changes" message is logged and no auditTrail.CommitAsync is called.

sakno commented 1 year ago

This is intended behavior, because the leader has nothing to commit. It doesn't prevent cluster leading. The message is only visible if Debug level enabled. Yes, I can change the text to avoid misunderstanding.