eBay / NuRaft

C++ implementation of Raft core logic as a replication library
Apache License 2.0
1.01k stars 237 forks source link

Registering multiple callback functions for node state change #216

Closed kishorekrd closed 2 years ago

kishorekrd commented 3 years ago

How can I register multiple callback functions for node state change. For example callback function for node becoming leader and another for node becoming follower. At what stage I should register callback function? Can I register after the init_raft()?

greensky00 commented 3 years ago

Hi @kishorekrd

Why do you want to register multiple callbacks? One callback function can handle multiple events for both being leader https://github.com/eBay/NuRaft/blob/1d66df6abe37ec4c754fb012d19d038860031a7b/include/libnuraft/callback.hxx#L66 and follower https://github.com/eBay/NuRaft/blob/1d66df6abe37ec4c754fb012d19d038860031a7b/include/libnuraft/callback.hxx#L97

Registering callback is the part of initialization, and you can't change it after that.

kishorekrd commented 3 years ago

Thanks for the response. Can you provide some sample code for registering callback during the init_raft() function? I see that

class raft_server : public std::enable_shared_from_this<raft_server> {
    friend class nuraft_global_mgr;
public:
    struct init_options {
        init_options()
            : skip_initial_election_timeout_(false)
            {}

        /**
         * If `true`, the election timer will not be initiated
         * automatically, so that this node will never trigger
         * leader election until it gets the first heartbeat
         * from any valid leader.
         *
         * Purpose: to avoid becoming leader when there is only one
         *          node in the cluster.
         */
        bool skip_initial_election_timeout_;

        /**
         * Callback function for hooking the operation.
         */
        cb_func::func_type raft_callback_;
    };
----------
    /**
     * (Read-only, but its contents will change)
     * Server context.
     */
    std::unique_ptr<context> ctx_;
--------------
struct context {
   /**
     * Register an event callback function.
     *
     * @param func Callback function to register.
     */
    void set_cb_func(cb_func::func_type func) {
        cb_func_ = cb_func(func);
    }
------------------------
// Initialize Raft server.
    stuff.raft_instance_ = stuff.launcher_.init(stuff.sm_,
                                                stuff.smgr_,
                                                stuff.raft_logger_,
                                                stuff.port_,
                                                asio_opt,
                                                params);
------------
Looks like I have to provide one more option "const raft_server::init_options& opt" here to init. 
opt.raft_callback_ = user_callback;

Is it correct?
greensky00 commented 3 years ago

@kishorekrd yes, that's correct.

kishorekrd commented 3 years ago

Thanks for the response. One more question. I see that there is a single "nuraft_commit" thread. Is that means all commits will be serialized. do I still need to worry about multi thread synchronization in state machine commit call

ptr<buffer> commit(const ulong log_idx, buffer& data)
greensky00 commented 3 years ago

Yes, commit is invoked by a single thread at a time, in the log_idx order.

kishorekrd commented 3 years ago

Thanks for your fast responses. Couple of more questions regarding saving the state and snapshot. Calculator example has one single variable as state

calc_state_machine()
        : cur_value_(0)
        , last_committed_idx_(0)
        {}

What if state machine is a C++ map object, that stores key/value mappings, how the following operations change?

save_state(const srv_state& state): ptr buf = state.serialize(); Write buf to disk

Does the state includes the entire map object? How often this is saved? I see read_state reads the disk but doesn't deserialize the data, is that means deserialized data is written to disk?

Snapshot:

// Put into snapshot map.
        ptr<snapshot_ctx> ctx = cs_new<snapshot_ctx>(ss, cur_value_);

Here instead of curvalue, map object will be used. Is it going to take deep copy of entire map object as sanpshot?

Persistence: For every commited operation like adding and deleting keys to this map, how can I make it persistent ? Is there any way to save delta of map object and flush to disk? or save the append_log entries to disk and reapply to recreate the map for recovery. Any other alternative methods?

greensky00 commented 3 years ago

Hi @kishorekrd

srv_state is not related to the user-defined state machine, it just stores the term and voting information which are independent of the user-defined contents https://github.com/eBay/NuRaft/blob/8715f27ac3fb3cd2fd6df946b7628123171e8340/include/libnuraft/srv_state.hxx#L156

What you need to do for load_state and save_state is read/write the given srv_state from/to disk. The implementation detail is up to you: you can use a raw file or a database to store it.

Regarding snapshot -- it should be logically an entire instance of the map at the given moment. But it doesn't need to be a physical deep clone of the state machine. As long as you can implement read_logical_snp_obj and save_logical_snp_obj correctly, the implementation detail is up to you. In eBay, we are using an MVCC storage engine so that we don't maintain a whole clone for each snapshot.

Is there any way to save delta of map object and flush to disk?

Again, it is up to you; it depends on how you implement the commit function of the state machine. You can invoke fsync of the entire map for each commit, which will be super expensive. Or you can make append and write_at functions in log_store call fsync every time, and then rely on the Raft logs (i.e., the delta of the map) and replay them to the state machine as a recovery.

kishorekrd commented 3 years ago

Thanks for your response. I see that there are followers (who participate in the leader election) and learners (Who doesn't participate in the leader election). Both of them gets leader heartbeat and also participate in the state machine replication. Is there any node type that participate in the leader election but does not participate in the state machine replication?

I am trying to increase the cluster size by restricting the state machine replication only for certain number of nodes.

For example if there are 10 nodes in the cluster, I want to replicate the state machine only on 5 nodes. Remaining nodes only gets leader heartbeat and also participate in the leader election.

greensky00 commented 3 years ago

Is there any node type that participate in the leader election but does not participate in the state machine replication?

No, there can't be such a member because replicated log information is needed for the correct vote. Otherwise, a wrong leader can be elected by votes from such members.

If what you want is to replicate logs, but not to apply them to the state machine, then you can check the type of member in the state_machine::commit(), and simply do nothing.

kishorekrd commented 3 years ago

OK, that make sense. I am trying to reduce the traffic between the leader and other nodes so that we can increase the cluster size. In your previous responses, you mentioned that biggest cluster size that you are using is 10. Is there any way to increase the cluster size?

greensky00 commented 3 years ago

10 is not the hard limit, and you can increase the cluster size with more burden to the leader. You can find your optimal size.

But if you want to maintain many replicas more than that, maybe you can think about tiered replication. There is a tier-1 Raft cluster with N replicas and each replica in tier-1 is a leader of tier-2 Raft cluster with also N replicas. You can serve N * N replicas by having N + 1 Raft clusters, while each Raft cluster manages N replicas only.

kishorekrd commented 3 years ago

Tiered replication idea looks good. That means Tier-1 servers will have 2 instances of Raft server. Is there any special configuration I have to do for this to work other than using different network port for each raft instances?

greensky00 commented 3 years ago

There is no such support by NuRaft, you need to set it up by yourself.