async snapshot install support

calvin2021y commented 10 months ago

last time I check with canonical/raft, no async snapshot install yet. (there is async snapshot but not async install snapshot)

I am plan use this library replace canonical/raft, is there already support async snapshot install, or plan to support it ?

freeekanayaka commented 10 months ago

Yes, my plan is to revisit the API more or less following the design outlined here:

https://github.com/canonical/raft/issues/430

The core part of that design has already been implemented in this library (see the new raft_step() API in include/raft.h), but there is still quite a bit of work left. Roughty:

update the libuv-based backend to use the new API and model
implement an alternative io_uring-based backend which should be faster than the current libuv one and leverage the new design
write documentation for the new API and concepts

Backward compatibility with the current v0 of the API will be retained throughout the development of v1, and even after v1 is complete I'll keep the backward compatibility code around for a while for people depending on it.

In the v1 API you'll be able to choose everything about async/sync behavior, no need to implement special/additional interfaces via new struct raft_io methods.

Basically your code will drive struct raft instead of struct raft driving your code.

For example, taking a snapshot should look something like this:

/* Your application takes a snapshot of your state machine. You
 * you decide whether to take the snapshot synchronously or
 * not. Once the snapshot operation has completed you inform
 * the `struct raft` object in this way: */
struct raft_event event;
struct raft_update update;

event.type = RAFT_SNAPSHOT;
event.snapshot.metadata.index = <last index applied to your fsm when the snapshot was taken>;
event.snapshot.metadata.term = <term of last index>

raft_step(r, &event, &update);

And that's it. The struct raft_update object might contain further actions that need to be taken by user code (such as sending messages, persisting entries). User code will be able to do that synchronously or asynchronously, no strict requirement. There will be default backends/helpers based on libuv and io_uring.

Hope that helps, I'll provide more details down the road.

calvin2021y commented 10 months ago

This is very help. the design is clear and I like it very much.

1) The libuv-based backend will be implement frist ? (I guess io-uring is could be fast, but libuv is more stable and universal)

2) raft_step is for single thread ?

freeekanayaka commented 10 months ago

This is very help. the design is clear and I like it very much.

The libuv-based backend will be implement frist ? (I guess io-uring is could be fast, but libuv is more stable and universal)

Yes, the libuv-based backend will be implemented first.

raft_step is for single thread ?

The struct raft object will be a pure state machine. When you call raft_step() you will just advance the state of the state machine. No "external" function or callback will be called. No raft_io interface, no system call, no user callback: raft_step() just takes a struct raft_event as input that describes what happened (e.g. the libuv backend has received a message, or your user code has taken a snapshot, etc) and puts some information in the output struct raft_update parameter (e.g. new messages that should be sent, new entries that should be persisted, etc).

So raft_step() is just a pure function call with nothing else happening except that modifying struct raft fields, exactly like sprintf() just modifies the buffer you pass it.

You can call raft_step() from multiple threads if you want, but you'll have to use a mutex. The struct raft object just provides the pure logic of the Raft algorithm, it will make no decision about the backend implementation: you can have a single-threaded backend, a multi-threaded one, a synchronous or asynchronous one.

freeekanayaka commented 8 months ago

@calvin2021y the low-level core part of the v1 API/design is now basically in place and already available in the main branch. You can read its initial documentation here:

https://raft.readthedocs.io/core.html

The documentation is still incomplete, but should be enough to get a sense of the actual new low-level API, which is supposed to be very general and flexible (at the price of being a bit complex).

End-users will not be supposed to consume this low-level core API directly, and will instead consume some simpler API that also offers out-of-the-box support for network and disk I/O. As I had mention, the first higher-level API that I'm going to implement will be based on libuv, and from the user's point of view it will feel something like this:

https://raft.readthedocs.io/quick-start.html

(not yet implemented).

The complete high-level libuv API that I have in mind is:

/* Hold metadata associated with a snapshot. */
struct raft_snapshot_metadata
{
    /* Index and term of last entry included in the snapshot. */
    raft_index index;
    raft_term term;

    /* Last committed configuration included in the snapshot, along with the
     * index it was committed at. */
    struct raft_configuration configuration;
    raft_index configuration_index;
};

struct uv_raft_s; /* libuv Raft handle */

/* init/close */
int uv_raft_init(struct uv_loop_s *loop, struct uv_raft_s *handle);
int uv_raft_close(struct uv_raft_s *handle, uv_close_cb close_cb);

/* Callback invoked every time a new log entry gets committed. The user can
 * process the entry data either synchronously or asynchronously. If data is
 * processed asynchronously, and a new entry is committed while the previous one
 * is still being processed, the user should queue up the new entry and process
 * it when possible. */
typedef void (*uv_raft_commit_cb)(struct uv_raft_s *,
                                  raft_index index,
                                  int type,
                                  uv_buf_t *data);

/* Callback invoked when a snapshot should be installed, replacing all of the
 * user's FSM state. Similarly to the commit callback, the user can process
 * the snapshot data either synchronously or asynchronously. */
typedef void (*uv_raft_install_cb)(struct uv_raft_s *,
                                   struct raft_snapshot_metadata *metadata,
                                   uv_buf_t *data);

/* Callback invoked whenever the raft state changes. For example this can be
 * used to fail pending requests when a leader steps down. */
typedef void (*uv_raft_state_cb)(struct uv_raft_s *,
                                 int old_state,
                                 int new_state);

/* Start a raft handle. The commit, install and state callbacks will be invoked
 * as long as the libuv loop runs.*/
int uv_raft_start(struct uv_raft_s *handle,
                  const char *dir,
                  uv_raft_commit_cb,
                  uv_raft_install_cb uv_raft_state_cb);

/* Submit a new entry to append to the log. The commit callback will be invoked
 * if an when the entry is successfully committed. */
int uv_raft_submit(struct uv_raft_s *handle, int type, uv_buf_t *data);

/* Users can start taking a snapshot at any time, either synchronously or
 * asynchronously. Once the process of taking a snapshot has completed they only
 * have to invoke this function. */
int uv_raft_snapshot(struct uv_raft_s *handle,
                     struct raft_snapshot_metadata *metadata,
                     uv_buf_t *data);

I think this API should be able to support both synchronous and asynchronous FSMs and snapshots.

However, it's still designed for FSMs that can live in memory and whose snapshots are small enough to also live in memory and fit in a uv_buf_t buffer without causing problems.

The low-level core API does not have this limitation, and can support also chunked snapshots. I've tried to come up with a higher-level API that also supports chunked snapshots, but it needs to be more complex than the simple one above. So perhaps it could be left for a later iteration.

What do you think? Is this simple version of the libuv integration enough for your use cases?

calvin2021y commented 8 months ago

Thanks for the great work.

1) if I start async install snapshot, do I need to tell raft any thing when it is finished ? in this period, uv_raft_commit_cb cloud get fired ? (I think I need to queue up logs and apply after install finished)

2) please consider add encrypt/decrypt callback to allow user handle the RPC request/response message

calvin2021y commented 8 months ago

typedef void (uv_raft_commit_cb)(struct uv_raft_s , raft_index index, int type, uv_buf_t *data);

the data should always freed by callback, so it easy be queue up. maybe you can put this into document.

freeekanayaka commented 8 months ago

Thanks for the great work.

Thanks for the feedback!

if I start async install snapshot, do I need to tell raft any thing when it is finished ?

I'm not entirely sure about this yet, strictly speaking it's not needed.

But to make things a bit more robust, I think it'd be good to add a function like:

void uv_raft_applied(struct uv_raft_s *handle, raft_index index);

To inform the engine that a certain commit index was processed. You would need to call that function also after installing a snapshot, passing the index of the snapshot.

This might be needed to avoid considering a configuration change committed before all other committed indexes preceding it are processed by the user. It's probably not relevant for most uses, so I'm not sure.

in this period, uv_raft_commit_cb cloud get fired ?

This is up for discussion. If we want that, then I think we'd need the uv_raft_applied() API, otherwise there's no way to know that the snapshot has been installed.

(I think I need to queue up logs and apply after install finished)

This is surely an option too, but I'm open to suggestions about this.

please consider add encrypt/decrypt callback to allow user handle the RPC request/response message

You mean a decode/encode hook? (that's a bit more generic not necessarily encrypt/decrypt).

What's your use case? TLS?

I think it'd be nice to support TLS natively.

Anyway, some way to customize the wire format of RPC request/response messages might be a good idea indeed.

calvin2021y commented 8 months ago

Use haproxy or nginx proxy can do the TLS work without change raft code. (need add non-tls entry-point for all other node on the node host)

I am consider use a fast AEAD cipher without TLS, to avoid the slow TLS handshake and overhead. maybe add lz4 compress before encrypt for non-entry message. this method could provide almost same latency like raw tcp.

cowsql / raft

async snapshot install support #123