canonical / raft

Unmaintained C implementation of the Raft consensus protocol
Other
939 stars 136 forks source link

Unable to restart raft server #180

Closed ntsh-gpt closed 3 years ago

ntsh-gpt commented 3 years ago

When I manually stop the raft server using interrupt signal, and then restart I see following error

raft_start(): io: load closed segment 0000000000000002-0000000000000013: entries batch 2 starting at byte 197: entries count 19005183 in preamble is too high.

base of raft_buffer passed in raft_apply() call is a dynamically allocated memory of size 157 bytes.

Any help or pointers are appreciated.

MathieuBordere commented 3 years ago

Thanks, we'll investigate.

freeekanayaka commented 3 years ago

This looks like a bug that indeed should be investigated. However, it should be greatly mitigated if you implement a graceful shutdown in your code, by catching whatever signal you are sending to the process, then calling raft_close and waiting for its callback to fire before exiting the process. That's exactly what is done in the example code.

ntsh-gpt commented 3 years ago

@freeekanayaka I am following the same approach as mentioned in the example code. However I would like to point out exact case for the issue,

Working :

struct A { int64_t x; int64_t y; }; struct A data; raft_buffer buf; buf.len = sizeof(struct A); buf.base = raft_malloc(buf.len); // assign data values to buf.base raft_apply(&raft, req, &buf, 1, serverApplyCb)

Not working :

FILE* f = fopen("path"); buf.len = ftell(f) ; fseek(f, 0, SEEK_SET); buf.base = raft_malloc(buf.len); fread(buf.base,1,buf.len,f); fclose(f); raft_apply(&raft, req, &buf, 1, serverApplyCb)

Note : I tried memcpy a char array to buf.base, still same issue.

tathougies commented 3 years ago

I believe I'm also hitting this bug. The idea of calling raft_close is useful, thanks, but this is a bit worrisome, and calls into question whether I should rely on this implementation. Raft is supposed to help with durability of log entries. That means that if my server dies due to power failure, I should be able to gracefully restore my machine. What is raft_close doing that can't be done after a raft_apply to make sure that the open segment can be read again? I'm happy to contribute to make things work, but I want to make sure that such a use case (power failure) is supposed to be supported.

freeekanayaka commented 3 years ago

I believe I'm also hitting this bug. The idea of calling raft_close is useful, thanks, but this is a bit worrisome, and calls into question whether I should rely on this implementation. Raft is supposed to help with durability of log entries. That means that if my server dies due to power failure, I should be able to gracefully restore my machine. What is raft_close doing that can't be done after a raft_apply to make sure that the open segment can be read again? I'm happy to contribute to make things work, but I want to make sure that such a use case (power failure) is supposed to be supported.

We do have Jepsen tests that simulate hard process crashes (SIGKILL) and recovery after that, probably we should run those tests more times and see if we can reproduce this particular bug (or just try what @ntsh-gpt as pointed).

As I said, this really looks like a bug that should be fixed, so the suggestion of using raft_close is just a mitigation, I'm not meaning that it is supposed to be a required procedure in order for the implementation to work.

ntsh-gpt commented 3 years ago

@MathieuBordere Will try out with #184.

MathieuBordere commented 3 years ago

@ntsh-gpt I deleted that comment, was premature, #184 won't solve it.

freeekanayaka commented 3 years ago

@ntsh-gpt I deleted that comment, was premature, #184 won't solve it.

Yeah, #184 should only be relevant for snapshots. Closed segments should alread be atomically created (by renaming an open segment).

MathieuBordere commented 3 years ago

@ntsh-gpt Are you hitting the same issue as in https://github.com/canonical/raft/issues/189, if yes, can I close?

stgraber commented 3 years ago

Closing for inactivity, assuming that the fix above took care of it. If not, let us know and we'll re-open.