Open mystenmark opened 2 years ago
My understanding (please lets confim) is that when the WAL is enabled (it is by default, and for us) all writes always go to the WAL, and the WAL is synced to avoid data loss. So I always interpreted set_sync as being about memtables rather than the WAL? Was I wrong?
I may be wrong - but my understanding is that the WAL is flushed but not synced with every write. This means that data is always in OS memory when the write returns, and hence safe against a crash of the validator process. But I do not believe that it is safe against a crash of the host OS or machine. I will investigate and confirm.
I may be wrong - but my understanding is that the WAL is flushed but not synced with every write. This means that data is always in OS memory when the write returns, and hence safe against a crash of the validator process. But I do not believe that it is safe against a crash of the host OS or machine. I will investigate and confirm.
I think this is correct - the option (set_sync) is meant for syncing wal (and not memtable) after every put
or instance, we could return a SignedTransactionEffects in response to
handle_certificate
, crash, and lose the data that was committed. There would now exist a cryptographic promise that the validator holds data that it does not.
This is a problem only if it happens concurrently on f+1
validators because validators should be able to catch up from other validators on startup. Losing that many validators simultaneously in a large network should be very rare statistically (as long as validators are spread out in different failure domains i.e. not most of them are running out of us-east-1a for example). Background wal sync is also something that most replicated distributed database do for performance (no blocking fsync in the request path) and cost (lower disk iops). We could potentially have a configuration to decide which db(s) and messages require stronger durability and turn on/off wal fsync accordingly
I think it is ok to say that f+1 validators don't have byzantine behavior, but I would be careful with applying f+1 logic for validator crash. More than f+1 validators may crash concurrently if, say, there is some bug in validator code that is triggered at the same time. (And bug in validator code can also cause OS crash, so we can't apply reasoning that if it's in kernel memory then we are ok)
I think such correlated failures are rare in practice but not impossible. One thing we could do is design the system to be configurable across two dimensions and run some experiments: 1> whether to wait for write to be synced to disk or not (this controls safety and performance i.e request latency) 2> how frequently wal gets synced (this controls IOPS) and manually invoke sync in the bg
I prefer to look at the question of byzantine validators in the following way:
This bug is about point 2, not point 1. If a validator (e.g.) signs a tx and loses the data, it is now byzantine (it will happily sign a conflicting tx).
I thought based on the description in this issue we were talking about handle_certificate
codepath where wal is written to and sync is done. I don't think signing a transaction without persisting locally (through locks in our implementation) is being questioned here. And hence a validator becoming byzantine is not possible based on the example above
@sadhansood I didn't see the wal is sync before sending the response in handle_certificate
in the narwhal weeks ago. Could you point the correct code piece if I am missing something.
For a real production deployment, it general considers the all nodes crash case in many distributed system implementation. For example, in etcd codebase, all followers not only write the logs to OS page but also call fsync to flush the logs to disk before sending vote response.
Since etcd raft implements pipeline optimization, it may flush multiple logs in one Save
call.
func (w *WAL) Save(st raftpb.HardState, ents []raftpb.Entry) error {
w.mu.Lock()
defer w.mu.Unlock()
// short cut, do not call sync
if raft.IsEmptyHardState(st) && len(ents) == 0 {
return nil
}
mustSync := raft.MustSync(st, w.state, len(ents))
// TODO(xiangli): no more reference operator
for i := range ents {
if err := w.saveEntry(&ents[i]); err != nil {
return err
}
}
if err := w.saveState(&st); err != nil {
return err
}
curOff, err := w.tail().Seek(0, io.SeekCurrent)
if err != nil {
return err
}
if curOff < SegmentSizeBytes {
if mustSync {
return w.sync()
}
return nil
}
return w.cut()
}
// MustSync returns true if the hard state and count of Raft entries indicate
// that a synchronous write to persistent storage is required.
func MustSync(st, prevst pb.HardState, entsnum int) bool {
// Persistent state on all servers:
// (Updated on stable storage before responding to RPCs)
// currentTerm
// votedFor
// log entries[]
return entsnum != 0 || st.Vote != prevst.Vote || st.Term != prevst.Term
}
Interesting thanks @GrapeBaBa , I'll take a look at etcd codebase.
I thought based on the description in this issue we were talking about
handle_certificate
codepath where wal is written to and sync is done. I don't think signing a transaction without persisting locally (through locks in our implementation) is being questioned here. And hence a validator becoming byzantine is not possible based on the example above
i believe it is - if we return a signed transaction, and the host os crashes before the data becomes durable, the validator is byzantine. This is also true in the case of returning signed effects after execution.
We currently do not use
WriteOptions::set_sync
(https://docs.rs/rocksdb/latest/rocksdb/struct.WriteOptions.html#method.set_sync), which means that if a validator host machine crashes, data could be lost. For instance, we could return a SignedTransactionEffects in response tohandle_certificate
, crash, and lose the data that was committed. There would now exist a cryptographic promise that the validator holds data that it does not.Simply turning on
WriteOptions::set_sync
for every write may result in much lower performance.An alternative would be to make db writes async, send them to another thread (as LockService currently does), issue the writes to the db in batches (not necessarily batches in the rocksdb sense), and call
DBWithThreadMode::flush_wal(true)
(https://docs.rs/rocksdb/latest/rocksdb/struct.DBWithThreadMode.html#method.flush_wal) before reporting success to the calling thread.In this way, we can amortize the cost of the sync over many operations, while ensuring that we never respond to a request before all the data written in the course of handling has been synced to disk.