Functionallity for snapshot suggestion

atlesn commented 7 months ago

To test my implementation of snapshoting, I added a function to suggest that the server performs a snapshot without needing to change the threshold to trigger snapshot to be taken.

I don't know if this is actually useful or not and I also don't know if we're supposed to add stuff to the legacy-interface. I haven't written any tests for this either.

Please comment on this whether or not this is something to continue working on.

freeekanayaka commented 7 months ago

Do you need this only for testing?

In the v1.x version of this Raft library, this functionality is not needed (in particular, the RAFT_SUGGEST_SNAPSHOT event is redundant, since you can just fire the RAFT_SNAPSHOT event).

So I'm slightly hesitant to add this functionality.

See https://raft.readthedocs.io/core.html for details of the v1.x API, whose core part is already in place, but unfortunately still lacks the libuv integration part.

All that being said, if you just need this functionality for testing, perhaps you could for example manually set the threshold to 0 and then submit a barrier entry, to force triggering a snapshot. Or something along those lines.

atlesn commented 7 months ago

Usage of snapshot suggestion function in RRR

Functional test in RRR test framework

atlesn commented 7 months ago

Do you need this only for testing?

In the v1.x version of this Raft library, this functionality is not needed (in particular, the RAFT_SUGGEST_SNAPSHOT event is redundant, since you can just fire the RAFT_SNAPSHOT event).

So I'm slightly hesitant to add this functionality.

See https://raft.readthedocs.io/core.html for details of the v1.x API, whose core part is already in place, but unfortunately still lacks the libuv integration part.

All that being said, if you just need this functionality for testing, perhaps you could for example manually set the threshold to 0 and then submit a barrier entry, to force triggering a snapshot. Or something along those lines.

Yes, it's just for testing to verify that the state produced by the state machine is correct after snapshoting. I'll see if I can implement it like you suggest, without changing my high level test, although I'm a little confused about these ABI versions :)

freeekanayaka commented 7 months ago

Do you need this only for testing? In the v1.x version of this Raft library, this functionality is not needed (in particular, the RAFT_SUGGEST_SNAPSHOT event is redundant, since you can just fire the RAFT_SNAPSHOT event). So I'm slightly hesitant to add this functionality. See https://raft.readthedocs.io/core.html for details of the v1.x API, whose core part is already in place, but unfortunately still lacks the libuv integration part. All that being said, if you just need this functionality for testing, perhaps you could for example manually set the threshold to 0 and then submit a barrier entry, to force triggering a snapshot. Or something along those lines.

Yes, it's just for testing to verify that the state produced by the state machine is correct after snapshoting. I'll see if I can implement it like you suggest, without changing my high level test, although I'm a little confused about these ABI versions :)

Yeah, that's unfortunate indeed. In a nutshell:

v0.x is the API that we had so far and that is used by cowsql and dqlite. It contains functions like raft_start() and raft_submit(), and it's mainly callback-based. The v0.x API is complete and has an I/O backend based on libuv. It's very easy to consume the v0.x API since it has everything you need to get a project running.
v1.x is a new API with a more flexible design. The core part of the v1.x API is already completed, and in fact the v0.x API is now just a shim around the v1.x API (the shim code is contained in src/legacy.c). However, the v1.x API does not yet have an I/O backend, so you can't actually use it for a project yet.

You can find some of the motivation of the new v1.x design here:

https://github.com/canonical/raft/issues/430

atlesn commented 7 months ago

Hi

I find there's a lot of resistance when using the UV implementation for multiple reasons especially as the application I'm integrating raft into already has IO and event stuff in place. I will instead try using the v1.x ABI so this PR could be closed. I now nevertheless have a working test suite which is helpful when implementing the new ABI which would require a complete rewrite.

I can see already now that there are some documentation issues with the ABI v1.x, like incorrect information in comments above for instance for raft_step, do you have any plans for this?

freeekanayaka commented 7 months ago

Hi

I find there's a lot of resistance when using the UV implementation for multiple reasons especially as the application I'm integrating raft into already has IO and event stuff in place. I will instead try using the v1.x ABI so this PR could be closed. I now nevertheless have a working test suite which is helpful when implementing the new ABI which would require a complete rewrite.

What event loop are you using? My short-term goal would be to use libuv to create an I/O backend for the v1.x API, but later on I'd like to build a io_uring-based backend.

I can see already now that there are some documentation issues with the ABI v1.x, like incorrect information in comments above for instance for raft_step, do you have any plans for this?

If you could find issues or open PRs for that, it'd be great, thanks. There is definitely work to do in order to better document the v1.x API. If you could point to me the parts where you are having troubles, that would be a good way for me to know where more information is needed.

atlesn commented 7 months ago

What event loop are you using? My short-term goal would be to use libuv to create an I/O backend for the v1.x API, but later on I'd like to build a io_uring-based backend.

I am using libevent but also an abstraction layer on top of that.

A solution without UV could have been to have a protocol layer on top of v1.x with which you can exchange network data to/from other nodes and also disk I/O data. You would then only implement the sockets, disk I/O and event parts yourself. There could be functions like bytes = raft_write(char *buf, size_t bufsize, ...), bytes = raft_read(const char *buf, size_t bufsize), raft_disk_write(...) etc. to move data in and out of the protocol. The stepping would happen as needed within the feeder functions and then you would need to get the next timer back somehow, maybe have a raft_need_step function.

As a start maybe the network packet (en/de)coding could be moved out of the uv framework and afterwards the raft logic. It would then be pretty straight forward to make implementations with any event loop or even without event loop.

What are your thoughts on that?

If you could find issues or open PRs for that, it'd be great, thanks. There is definitely work to do in order to better document the v1.x API. If you could point to me the parts where you are having troubles, that would be a good way for me to know where more information is needed.

I will certainly do that.

freeekanayaka commented 7 months ago

What event loop are you using? My short-term goal would be to use libuv to create an I/O backend for the v1.x API, but later on I'd like to build a io_uring-based backend.

I am using libevent but also an abstraction layer on top of that.

A solution without UV could have been to have a protocol layer on top of v1.x with which you can exchange network data to/from other nodes and also disk I/O data. You would then only implement the sockets, disk I/O and event parts yourself. There could be functions like bytes = raft_write(char *buf, size_t bufsize, ...), bytes = raft_read(const char *buf, size_t bufsize), raft_disk_write(...) etc. to move data in and out of the protocol. The stepping would happen as needed within the feeder functions and then you would need to get the next timer back somehow, maybe have a raft_need_step function.

As a start maybe the network packet (en/de)coding could be moved out of the uv framework and afterwards the raft logic. It would then be pretty straight forward to make implementations with any event loop or even without event loop.

What are your thoughts on that?

I'm don't entirely understand what you mean, but it seems I was thinking something similar. My plan was roughly to:

Implement a libuv-based I/O backend on top of the current v1.x core API (i.e. raft_step())
After that is in place, come up with some kind of abstractions that extract the protocol/data-flow logic from the libuv-backend, making that generic and event-loop agnostic
Implement a io_uring-based I/O backend using the abstractions from 2.

I don't think it's entirely as straightforward as it seems, at least if we want to make everything pull-driven (i.e. no callbacks) and non-blocking.

I haven't thought about this hard enough yet though, so if you want to elaborate more on your ideas, I'm happy to hear them and perhaps get some inspiration.

If you could find issues or open PRs for that, it'd be great, thanks. There is definitely work to do in order to better document the v1.x API. If you could point to me the parts where you are having troubles, that would be a good way for me to know where more information is needed.

I will certainly do that.

Thanks!

atlesn commented 7 months ago

I was thinking about the "pull" method as well and how this could be done, I've thought a bit more on how to implement that. I think the pulling you suggest is very straight forward! It could also be quite efficient as it allows for tight loops when performing misc tasks without long call trees. Let's hear your thoughts on this :)

Let's say we had the following layers:

Input/Output including any event loop and FSM (implementer fully responsible for this layer). Receives tasks from the Bridge layer and exchanges data blobs.
Bridge (new layer) will handle encoding/decoding and calls into existing ABI v1.x as needed. It can also be considered as an extension to ABI v1.x, although the new I/O layer need only use the new functions.
Raft handles the basic raft logic

Whenever Bridge is called from IO, the structure raft_tasks is passed. The Bridge may populate this structure with IO and FSM tasks. The IO layer must perform the tasks immediately before acknowledging that the tasks are complete.

The new Bridge layer exposes these functions:

raft_net_read delivers incoming network data from IO into Bridge.
raft_file_read gives any read file contents from IO to Bridge if that was requested in a task.
raft_acknowledge tells the Bridge that tasks has been performed.
raft_stash is to store a new message. Automatic forwarding to the leader could be implemented here.

The IO layer knows nothing about the contents of network packets or file data. It simply exchanges data blobs with the Bridge layer and will create any network connections and open files as needed. On the other hand, the Brige layer does not know anything about network connections or file system.

I don't think the raft_file_read can produce new tasks as it may be called multiple times before an acknowledgement if there are many file read tasks at once. In that case it is not passed the tasks structure or it can be passed as const.

Receiving/sending network data from/to other servers

As a thought flow, whenever data is received on a network connection, the function raft_net_read from the Bridge layer is called. The implementer gets feedback from raft_net_read on how many bytes were consumed e.g. when a whole frame or packet was processed.

raft_net_read, after receiving a full packet, might call into the raft code using raft_step or other functions and then checks for update flags. If any update requires IO or FSM application, tasks are created for this. raft_net_read may be called with zero byte buffer in case of timeout or when initially starting up. raft_acknowledge and raft_stash my also call raft_step.

The Bridge layer need not worry about any network connections. It simply informs the IO layer about which server ID data in the tasks is destined for. Incoming data is also distinguished using server ID. The IO layer hence needs to know which connections it has for which server ID. This also does not need to be actual connections if another means of transport is used.

If there are problems with sending outgoing network data, the IO layer just calls acknowledge like if the data was sent. The situation will be detected by other means by the Bridge layer. As an optimization, there could be a helper function in the bridge to check for which servers the IO layer needs to communicate with (e.g. which server is part of the cluster or is the leader).

Getting IO tasks from the Bridge layer

The structure raft_tasks is used to inform the IO layer about any IO tasks needing to be performed. The structure is passed as a pointer into raft_net_read and other functions, and after each call, the IO layer must check if it was populated with any tasks. The structure has three arrays of types raft_net_task, raft_file_task and raft_fsm_task. The IO layer iterates these lists and does any of the required IO or FSM operations described within.

Memory for the tasks is owned and handled by the Bridge layer and it may be re-used or arena-allocated. The IO layer calls raft_acknowledge after all IO tasks has been performed, still passing the same unchanged raft_tasks structure. IO must not call other functions before all tasks are complete. The Bridge layer will reset the task lists after acknowledgement, but can also choose to add more tasks to the structure, and if so, the IO layer must perform the tasks and call acknowledge again.

Reading files

The Bridge layer provides the file names and data blobs to write to or read from in the raft_file_task list. Files (or ENOENT situation) are delivered to Bridge using raft_file_read.

Reading file data

If the Bridge wants to read metadata or other files from disk, it will make tasks for this describing the file names. The function raft_file_read will be used to pass the file data.

If there are concerns about copying a lot of data in memory, data exchange could be implemented as fetch/put callbacks set in the tasks structure. Both methods could be used as appropriate. Data could also be composed of chunks and point to different locations in memory to avoid duplication.

----

I think this could be a very simple interface to use and, but of course it requires to move a lot of logic around (which would be required regardless of solution). When it comes to UV, this would be a built-in IO layer. It would then contain only low level stuff so that the existing raft logic can be re-used by other implementations.

freeekanayaka commented 7 months ago

@atlesn thanks for the detailed write-up. Right now I don't have enough time to really think deep about everything you wrote, I'll probably do it in the next few days. But it seems we're on the same page on several design ideas, and I believe what I'd like to do implement is in some way close to what you're suggesting.

I will look at this again as time permits.

cowsql / raft