MystenLabs / sui

Sui, a next-generation smart contract platform with high throughput, low latency, and an asset-oriented programming model powered by the Move programming language
https://sui.io
Apache License 2.0
6.11k stars 11.16k forks source link

Ensure committed data cannot be lost. #3210

Open mystenmark opened 2 years ago

mystenmark commented 2 years ago

We currently do not use WriteOptions::set_sync (https://docs.rs/rocksdb/latest/rocksdb/struct.WriteOptions.html#method.set_sync), which means that if a validator host machine crashes, data could be lost. For instance, we could return a SignedTransactionEffects in response to handle_certificate, crash, and lose the data that was committed. There would now exist a cryptographic promise that the validator holds data that it does not.

Simply turning on WriteOptions::set_sync for every write may result in much lower performance.

An alternative would be to make db writes async, send them to another thread (as LockService currently does), issue the writes to the db in batches (not necessarily batches in the rocksdb sense), and call DBWithThreadMode::flush_wal(true) (https://docs.rs/rocksdb/latest/rocksdb/struct.DBWithThreadMode.html#method.flush_wal) before reporting success to the calling thread.

In this way, we can amortize the cost of the sync over many operations, while ensuring that we never respond to a request before all the data written in the course of handling has been synced to disk.

gdanezis commented 2 years ago

My understanding (please lets confim) is that when the WAL is enabled (it is by default, and for us) all writes always go to the WAL, and the WAL is synced to avoid data loss. So I always interpreted set_sync as being about memtables rather than the WAL? Was I wrong?

mystenmark commented 2 years ago

I may be wrong - but my understanding is that the WAL is flushed but not synced with every write. This means that data is always in OS memory when the write returns, and hence safe against a crash of the validator process. But I do not believe that it is safe against a crash of the host OS or machine. I will investigate and confirm.

sadhansood commented 2 years ago

I may be wrong - but my understanding is that the WAL is flushed but not synced with every write. This means that data is always in OS memory when the write returns, and hence safe against a crash of the validator process. But I do not believe that it is safe against a crash of the host OS or machine. I will investigate and confirm.

I think this is correct - the option (set_sync) is meant for syncing wal (and not memtable) after every put

or instance, we could return a SignedTransactionEffects in response to handle_certificate, crash, and lose the data that was committed. There would now exist a cryptographic promise that the validator holds data that it does not.

This is a problem only if it happens concurrently on f+1 validators because validators should be able to catch up from other validators on startup. Losing that many validators simultaneously in a large network should be very rare statistically (as long as validators are spread out in different failure domains i.e. not most of them are running out of us-east-1a for example). Background wal sync is also something that most replicated distributed database do for performance (no blocking fsync in the request path) and cost (lower disk iops). We could potentially have a configuration to decide which db(s) and messages require stronger durability and turn on/off wal fsync accordingly

andll commented 2 years ago

I think it is ok to say that f+1 validators don't have byzantine behavior, but I would be careful with applying f+1 logic for validator crash. More than f+1 validators may crash concurrently if, say, there is some bug in validator code that is triggered at the same time. (And bug in validator code can also cause OS crash, so we can't apply reasoning that if it's in kernel memory then we are ok)

sadhansood commented 2 years ago

I think such correlated failures are rare in practice but not impossible. One thing we could do is design the system to be configurable across two dimensions and run some experiments: 1> whether to wait for write to be synced to disk or not (this controls safety and performance i.e request latency) 2> how frequently wal gets synced (this controls IOPS) and manually invoke sync in the bg

mystenmark commented 2 years ago

I prefer to look at the question of byzantine validators in the following way:

  1. the system must tolerate up to f byzantine validators.
  2. an honest validator must not become byzantine because of bugs in the code.

This bug is about point 2, not point 1. If a validator (e.g.) signs a tx and loses the data, it is now byzantine (it will happily sign a conflicting tx).

sadhansood commented 2 years ago

I thought based on the description in this issue we were talking about handle_certificate codepath where wal is written to and sync is done. I don't think signing a transaction without persisting locally (through locks in our implementation) is being questioned here. And hence a validator becoming byzantine is not possible based on the example above

GrapeBaBa commented 2 years ago

@sadhansood I didn't see the wal is sync before sending the response in handle_certificate in the narwhal weeks ago. Could you point the correct code piece if I am missing something.

For a real production deployment, it general considers the all nodes crash case in many distributed system implementation. For example, in etcd codebase, all followers not only write the logs to OS page but also call fsync to flush the logs to disk before sending vote response.

Since etcd raft implements pipeline optimization, it may flush multiple logs in one Save call.

func (w *WAL) Save(st raftpb.HardState, ents []raftpb.Entry) error {
    w.mu.Lock()
    defer w.mu.Unlock()

    // short cut, do not call sync
    if raft.IsEmptyHardState(st) && len(ents) == 0 {
        return nil
    }

    mustSync := raft.MustSync(st, w.state, len(ents))

    // TODO(xiangli): no more reference operator
    for i := range ents {
        if err := w.saveEntry(&ents[i]); err != nil {
            return err
        }
    }
    if err := w.saveState(&st); err != nil {
        return err
    }

    curOff, err := w.tail().Seek(0, io.SeekCurrent)
    if err != nil {
        return err
    }
    if curOff < SegmentSizeBytes {
        if mustSync {
            return w.sync()
        }
        return nil
    }

    return w.cut()
}

// MustSync returns true if the hard state and count of Raft entries indicate
// that a synchronous write to persistent storage is required.
func MustSync(st, prevst pb.HardState, entsnum int) bool {
    // Persistent state on all servers:
    // (Updated on stable storage before responding to RPCs)
    // currentTerm
    // votedFor
    // log entries[]
    return entsnum != 0 || st.Vote != prevst.Vote || st.Term != prevst.Term
}
sadhansood commented 2 years ago

Interesting thanks @GrapeBaBa , I'll take a look at etcd codebase.

mystenmark commented 1 year ago

I thought based on the description in this issue we were talking about handle_certificate codepath where wal is written to and sync is done. I don't think signing a transaction without persisting locally (through locks in our implementation) is being questioned here. And hence a validator becoming byzantine is not possible based on the example above

i believe it is - if we return a signed transaction, and the host os crashes before the data becomes durable, the validator is byzantine. This is also true in the case of returning signed effects after execution.