elixir-toniq / raft

An Elixir implementation of the raft consensus protocol
Apache License 2.0
427 stars 29 forks source link

What to do with invalid messages? #9

Open keathley opened 6 years ago

keathley commented 6 years ago

Lets say we have a state machine like so:

defmodule Stack do
  def handle_write({:put, str}, stack) when is_binary(str) do
    new_stack = [str | stack]
    {Enum.count(new_stack), new_stack}
  end
end

We have a stack that expects string arguments (I have no idea why anyone would do this in real life but it should illustrate the point).

The problem is that we can send any message to the raft process like so: Raft.write(leader, {:put, 1}). That message will effectively kill the raft process. This is because the message is only applied to the users state machine after its been persisted to disk. The raft process will attempt to apply this message, crash (which will cause the log to crash), the server will be restarted, attempt to apply the log message again, crash, etc. Generally speaking, any incorrect message has the potential to corrupt the log.

I think we should give users an "error handling" option. In some cases they may want to allow the exception to crash the raft process. In other cases they may want to simply log the error and simply "ignore" that message. My initial thought would be to provide something like this:

defmodule Stack do
  use Raft.StateMachine, on_error: :nothing # Logs the error but maintains the user state machines current state and moves on.

  use Raft.StateMachine, on_error: :raise # Raises and crashes the raft process.
end

What do y'all think?

bitwalker commented 6 years ago

Shouldn't they handle this with a default clause? It gives them the same choice and semantics right? You can't discard messages from the log, so either you define that clause and discard (or whatever) or let things crash

keathley commented 6 years ago

In this scenario they can. If the error is inside the clause then it won't help. Part of the problem with letting the process crash here is that the "bad" message has already been persisted to disk. So we can't let it crash and start in a known good state. The state underlying state would be corrupted at that point.

bitwalker commented 6 years ago

Well once the bad message is received it has effectively poisoned that node right? Regardless of how the error is handled, since it has been persisted to the log. So basically the only way to proceed is to crash the node. If you ignore the error, you are making a judgement about the validity of that data. If you ignore it and it was actually a good message containing data you cared about but just didn't recognize because of a mistake in the code, then further actions based on the absence of that data will truly corrupt the log.

It seems to me that in such cases, the only fix is to bring down the node until a fix is applied which makes an explicit decision about the message causing the crash. Once that decision is made, the cluster can continue. I think it's too dangerous to have a big "ignore all unrecognized messages" button, it is far too likely that a bug in code will cause valid data to be dropped on the floor and result in hard to detect bugs because nodes are operating on effectively corrupted state. Better to have things all go to hell in a hand basket when a bug occurs than to hide it accidentally in an effort to remain available.

Let me know if I'm misunderstanding though, I may be thinking about this from the wrong angle.

NeilMenne commented 6 years ago

The problem here isn't that it has poisoned a node; it's that it has sunk the raft as the application is done after it has been successfully replicated. There are at least two obvious potential failures for a bad command. Overly defensive pattern matching (like @keathley's example) and implementation errors (i.e. an error in the actual application of the command to the state machine.

Since we're talking about both of these, a possibility is to apply by default @bitwalker's everything must die approach by reraising here. Coming up with a sufficiently documented, informative skip-over approach can then be added and configurable.

CrowdHailer commented 6 years ago

Is it possible to use the log replication in raft to store the changes of each state-machine after the commands have been applied rather than before? then an invalid command would not get committed because the commit steps would only happen after the next state of the state machine was known

keathley commented 6 years ago

As we discussed in other forums you would open up a ton of questions if you did this. To the point where you would need to create your own TLA proofs or something similar to prove you still had consistency guarantees.