Hoverbear / old-raft-rs

[Incomplete] A Raft implementation in Rust
https://hoverbear.github.io/raft-rs/raft/
MIT License
266 stars 41 forks source link

Create a log that persists to the filesystem. #129

Closed jcdyer closed 7 years ago

jcdyer commented 7 years ago

This adds a new FsLog struct that implements persistent_log::Log. The raft spec requires that certain information survive a server outage to maintain correct behavior. The memlog only maintains correctness if all servers remain alive. This should fix that.

I wrote some benchmarks for it as well. See results on https://github.com/jcdyer/raft-rs/pull/1.

@arthurprs provided helpful review suggestions.

Hoverbear commented 7 years ago

Cool! :)

Very happy to see a benchmark! Code quality seems quite good.

Is there a reason you didn't use Cap'n Proto's built in way to write instead of doing byteorder? I'd also be quite interested in moving the code to use Serde with Bincode or something.

arthurprs commented 7 years ago

I think it's a reasonable implementation considering the current state of the log trait (no gc, etc.). :+1:

jcdyer commented 7 years ago

@Hoverbear I haven't worked with Cap'n'Proto yet, so I just did what seemed natural (Which started with a manual function using bitshifting and masking, until @arthurprs suggested ByteOrder). The reason I didn't use a simple serde on the log struct as a whole is that I wanted to support partial rewrites, but it shouldn't be too hard to refactor it to serde encode the header and each log entry (as (LogIndex, Entry)) separately.

One caveat is that it would need to use a serialization format that produces a constant-sized header, so changing terms doesn't require rewriting the whole log. Does bincode do this? I know msgpack could produce variably sized integers, so that wouldn't fly.

arthurprs commented 7 years ago

I'm not a fan of capnproto when writing to disk. You eventually have to add your own header around it (length, checksum, etc..) and end up with byteorder again. It wouldn't carry it's weight in this simplified implementation anyway.

One caveat is that it would need to use a serialization format that produces a constant-sized header, so changing terms doesn't require rewriting the whole log. Does bincode do this? I know msgpack could produce variably sized integers, so that wouldn't fly.

What I'd recommend instead is to never to overwrite the log (or header), a new "header" is just a new entry. When a new log segment is created you always write the "header" as the first entry and the recovery time worst case is replaying the last segment until EOF or corrupt entry (checksum).

Rewriting headers is complicated, it usually costs more io and you have to worry about torn writes corrupting it (not the case here though).

Hoverbear commented 7 years ago

@jcdyer It's ok if you don't use capnp... I'd actually like to move to serde to be honest. :)

jcdyer commented 7 years ago

Cool. So before we merge this, what of the following would you like me to implement:

I think we could do some of it in later commits, though I also understand if you don't want to introduce a file format and then change it out from under users who might be depending on it.

Hoverbear commented 7 years ago

I think moving to Serde will be a later, separate, crate-wide thing, to be honest. The error handling as it stands seems okay.

Merging this seems reasonable. I do have one last question though. I notice there is a rewrite_entries function. Generally, at least to my understanding of the protocol, the Raft log is append-only. Is there something I'm missing about why this is needed?

(Forgive me if this is a stupid question, it's 7:00am)

jcdyer commented 7 years ago

@Hoverbear The log is not append only. The leader only ever appends entries, but the followers are required to rewrite entries when they receive an entry that has a conflicting term at the same index. This is detailed in Fig 2 / AppendEntries RPC / Receiver implementation / #3: "If an existing entry conflicts with a new one (same index but different terms), delete the existing entry and all that follow it."

Once an entry is committed it should be permanent in the log. Entries that have not been committed still need to be persisted

In the MemLog implemetation, the followers also rewrite their logs, but they do so unconditionally when they receive new entries (see append_entries, where there's a self.entries.truncate() call), which is contrary to the raft spec, but I think you've mitigated the problems that could arise from not waiting to find a conflicting entry by tracking message order, and discarding stale messages.

Hope that helps.

jcdyer commented 7 years ago

I would like to add a version number to the beginning of the log, to make it easier to upgrade if we decide to improve the format later. Just a simple 1u64 at the beginning, to be checked when initially opening the log file. So we don't try to read a later v2 log, and v2 implementations don't try to read a v1 log (or at least know that they're reading a v1 log.

Hoverbear commented 7 years ago

Thanks for that reasoning and you're entirely correct. :)

@arthurprs Any further requests or shall we merge?

arthurprs commented 7 years ago

Let's merge :)

Hoverbear commented 7 years ago

Thanks, @jcdyer =D