Open atlesn opened 7 months ago
Do you need this only for testing?
In the v1.x version of this Raft library, this functionality is not needed (in particular, the RAFT_SUGGEST_SNAPSHOT
event is redundant, since you can just fire the RAFT_SNAPSHOT
event).
So I'm slightly hesitant to add this functionality.
See https://raft.readthedocs.io/core.html for details of the v1.x API, whose core part is already in place, but unfortunately still lacks the libuv integration part.
All that being said, if you just need this functionality for testing, perhaps you could for example manually set the threshold to 0 and then submit a barrier entry, to force triggering a snapshot. Or something along those lines.
Do you need this only for testing?
In the v1.x version of this Raft library, this functionality is not needed (in particular, the
RAFT_SUGGEST_SNAPSHOT
event is redundant, since you can just fire theRAFT_SNAPSHOT
event).So I'm slightly hesitant to add this functionality.
See https://raft.readthedocs.io/core.html for details of the v1.x API, whose core part is already in place, but unfortunately still lacks the libuv integration part.
All that being said, if you just need this functionality for testing, perhaps you could for example manually set the threshold to 0 and then submit a barrier entry, to force triggering a snapshot. Or something along those lines.
Yes, it's just for testing to verify that the state produced by the state machine is correct after snapshoting. I'll see if I can implement it like you suggest, without changing my high level test, although I'm a little confused about these ABI versions :)
Do you need this only for testing? In the v1.x version of this Raft library, this functionality is not needed (in particular, the
RAFT_SUGGEST_SNAPSHOT
event is redundant, since you can just fire theRAFT_SNAPSHOT
event). So I'm slightly hesitant to add this functionality. See https://raft.readthedocs.io/core.html for details of the v1.x API, whose core part is already in place, but unfortunately still lacks the libuv integration part. All that being said, if you just need this functionality for testing, perhaps you could for example manually set the threshold to 0 and then submit a barrier entry, to force triggering a snapshot. Or something along those lines.Yes, it's just for testing to verify that the state produced by the state machine is correct after snapshoting. I'll see if I can implement it like you suggest, without changing my high level test, although I'm a little confused about these ABI versions :)
Yeah, that's unfortunate indeed. In a nutshell:
v0.x is the API that we had so far and that is used by cowsql and dqlite. It contains functions like raft_start()
and raft_submit()
, and it's mainly callback-based. The v0.x API is complete and has an I/O backend based on libuv. It's very easy to consume the v0.x API since it has everything you need to get a project running.
v1.x is a new API with a more flexible design. The core part of the v1.x API is already completed, and in fact the v0.x API is now just a shim around the v1.x API (the shim code is contained in src/legacy.c
). However, the v1.x API does not yet have an I/O backend, so you can't actually use it for a project yet.
You can find some of the motivation of the new v1.x design here:
Hi
I find there's a lot of resistance when using the UV implementation for multiple reasons especially as the application I'm integrating raft into already has IO and event stuff in place. I will instead try using the v1.x ABI so this PR could be closed. I now nevertheless have a working test suite which is helpful when implementing the new ABI which would require a complete rewrite.
I can see already now that there are some documentation issues with the ABI v1.x, like incorrect information in comments above for instance for raft_step
, do you have any plans for this?
Hi
I find there's a lot of resistance when using the UV implementation for multiple reasons especially as the application I'm integrating raft into already has IO and event stuff in place. I will instead try using the v1.x ABI so this PR could be closed. I now nevertheless have a working test suite which is helpful when implementing the new ABI which would require a complete rewrite.
What event loop are you using? My short-term goal would be to use libuv to create an I/O backend for the v1.x API, but later on I'd like to build a io_uring-based backend.
I can see already now that there are some documentation issues with the ABI v1.x, like incorrect information in comments above for instance for
raft_step
, do you have any plans for this?
If you could find issues or open PRs for that, it'd be great, thanks. There is definitely work to do in order to better document the v1.x API. If you could point to me the parts where you are having troubles, that would be a good way for me to know where more information is needed.
What event loop are you using? My short-term goal would be to use libuv to create an I/O backend for the v1.x API, but later on I'd like to build a io_uring-based backend.
I am using libevent but also an abstraction layer on top of that.
A solution without UV could have been to have a protocol layer on top of v1.x with which you can exchange network data to/from other nodes and also disk I/O data. You would then only implement the sockets, disk I/O and event parts yourself. There could be functions like bytes = raft_write(char *buf, size_t bufsize, ...)
, bytes = raft_read(const char *buf, size_t bufsize)
, raft_disk_write(...)
etc. to move data in and out of the protocol. The stepping would happen as needed within the feeder functions and then you would need to get the next timer back somehow, maybe have a raft_need_step
function.
As a start maybe the network packet (en/de)coding could be moved out of the uv framework and afterwards the raft logic. It would then be pretty straight forward to make implementations with any event loop or even without event loop.
What are your thoughts on that?
If you could find issues or open PRs for that, it'd be great, thanks. There is definitely work to do in order to better document the v1.x API. If you could point to me the parts where you are having troubles, that would be a good way for me to know where more information is needed.
I will certainly do that.
What event loop are you using? My short-term goal would be to use libuv to create an I/O backend for the v1.x API, but later on I'd like to build a io_uring-based backend.
I am using libevent but also an abstraction layer on top of that.
A solution without UV could have been to have a protocol layer on top of v1.x with which you can exchange network data to/from other nodes and also disk I/O data. You would then only implement the sockets, disk I/O and event parts yourself. There could be functions like
bytes = raft_write(char *buf, size_t bufsize, ...)
,bytes = raft_read(const char *buf, size_t bufsize)
,raft_disk_write(...)
etc. to move data in and out of the protocol. The stepping would happen as needed within the feeder functions and then you would need to get the next timer back somehow, maybe have araft_need_step
function.As a start maybe the network packet (en/de)coding could be moved out of the uv framework and afterwards the raft logic. It would then be pretty straight forward to make implementations with any event loop or even without event loop.
What are your thoughts on that?
I'm don't entirely understand what you mean, but it seems I was thinking something similar. My plan was roughly to:
raft_step()
)I don't think it's entirely as straightforward as it seems, at least if we want to make everything pull-driven (i.e. no callbacks) and non-blocking.
I haven't thought about this hard enough yet though, so if you want to elaborate more on your ideas, I'm happy to hear them and perhaps get some inspiration.
If you could find issues or open PRs for that, it'd be great, thanks. There is definitely work to do in order to better document the v1.x API. If you could point to me the parts where you are having troubles, that would be a good way for me to know where more information is needed.
I will certainly do that.
Thanks!
I was thinking about the "pull" method as well and how this could be done, I've thought a bit more on how to implement that. I think the pulling you suggest is very straight forward! It could also be quite efficient as it allows for tight loops when performing misc tasks without long call trees. Let's hear your thoughts on this :)
Let's say we had the following layers:
Whenever Bridge is called from IO, the structure raft_tasks
is passed. The Bridge may populate this structure with IO and FSM tasks. The IO layer must perform the tasks immediately before acknowledging that the tasks are complete.
The new Bridge layer exposes these functions:
raft_net_read
delivers incoming network data from IO into Bridge. raft_file_read
gives any read file contents from IO to Bridge if that was requested in a task.raft_acknowledge
tells the Bridge that tasks has been performed.raft_stash
is to store a new message. Automatic forwarding to the leader could be implemented here.The IO layer knows nothing about the contents of network packets or file data. It simply exchanges data blobs with the Bridge layer and will create any network connections and open files as needed. On the other hand, the Brige layer does not know anything about network connections or file system.
I don't think the raft_file_read
can produce new tasks as it may be called multiple times before an acknowledgement if there are many file read tasks at once. In that case it is not passed the tasks structure or it can be passed as const.
As a thought flow, whenever data is received on a network connection, the function raft_net_read
from the Bridge layer is called. The implementer gets feedback from raft_net_read
on how many bytes were consumed e.g. when a whole frame or packet was processed.
raft_net_read
, after receiving a full packet, might call into the raft code using raft_step
or other functions and then checks for update flags. If any update requires IO or FSM application, tasks are created for this. raft_net_read
may be called with zero byte buffer in case of timeout or when initially starting up. raft_acknowledge
and raft_stash
my also call raft_step
.
The Bridge layer need not worry about any network connections. It simply informs the IO layer about which server ID data in the tasks is destined for. Incoming data is also distinguished using server ID. The IO layer hence needs to know which connections it has for which server ID. This also does not need to be actual connections if another means of transport is used.
If there are problems with sending outgoing network data, the IO layer just calls acknowledge like if the data was sent. The situation will be detected by other means by the Bridge layer. As an optimization, there could be a helper function in the bridge to check for which servers the IO layer needs to communicate with (e.g. which server is part of the cluster or is the leader).
The structure raft_tasks
is used to inform the IO layer about any IO tasks needing to be performed. The structure is passed as a pointer into raft_net_read
and other functions, and after each call, the IO layer must check if it was populated with any tasks. The structure has three arrays of types raft_net_task
, raft_file_task
and raft_fsm_task
. The IO layer iterates these lists and does any of the required IO or FSM operations described within.
Memory for the tasks is owned and handled by the Bridge layer and it may be re-used or arena-allocated. The IO layer calls raft_acknowledge
after all IO tasks has been performed, still passing the same unchanged raft_tasks
structure. IO must not call other functions before all tasks are complete. The Bridge layer will reset the task lists after acknowledgement, but can also choose to add more tasks to the structure, and if so, the IO layer must perform the tasks and call acknowledge again.
The Bridge layer provides the file names and data blobs to write to or read from in the raft_file_task
list. Files (or ENOENT situation) are delivered to Bridge using raft_file_read
.
If the Bridge wants to read metadata or other files from disk, it will make tasks for this describing the file names. The function raft_file_read
will be used to pass the file data.
If there are concerns about copying a lot of data in memory, data exchange could be implemented as fetch/put callbacks set in the tasks structure. Both methods could be used as appropriate. Data could also be composed of chunks and point to different locations in memory to avoid duplication.
I think this could be a very simple interface to use and, but of course it requires to move a lot of logic around (which would be required regardless of solution). When it comes to UV, this would be a built-in IO layer. It would then contain only low level stuff so that the existing raft logic can be re-used by other implementations.
@atlesn thanks for the detailed write-up. Right now I don't have enough time to really think deep about everything you wrote, I'll probably do it in the next few days. But it seems we're on the same page on several design ideas, and I believe what I'd like to do implement is in some way close to what you're suggesting.
I will look at this again as time permits.
To test my implementation of snapshoting, I added a function to suggest that the server performs a snapshot without needing to change the threshold to trigger snapshot to be taken.
I don't know if this is actually useful or not and I also don't know if we're supposed to add stuff to the legacy-interface. I haven't written any tests for this either.
Please comment on this whether or not this is something to continue working on.