etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
46.82k stars 9.65k forks source link

Proposals should include a merkle root #13839

Open lavalamp opened 2 years ago

lavalamp commented 2 years ago

In response to #13766 and past issues (e.g. #11613) that get into the same condition:

It is possible for etcd to get into a condition where the databases on different replicas have different contents, and then proceed committing changes. There have been other routes to this in the past. I don't think it's possible to prevent all db corruption issues, but it is certainly possible to make etcd stop and recover when the DB doesn't match.

The reason etcd can proceed is because the protocol has replicas agree on the contents of changes but not on the state of the db after applying the change. The easiest fix for this is to adjust things to compute a hash covering the entire db. This can be done efficiently (log N hashing operations per db change) by using a merkle tree. If proposed changes also included a proposed new merkle root hash, a replica with a differing db would not be able to accept a change, and this condition would be caught the instant it happened.

(And it's recoverable at that point by reading the correct state from the other replicas. Moreover, depending on how you implement it, the merkle tree could be examined a layer at a time to find what is corrupted instead of needing to copy the entire state, which could be large.)

The merkle tree technique is very common in cryptocurrencies. etcd is basically a cryptocurrency with no token and a non-byzantine-resistant coordination mechanism. These are both correct tradeoffs given what etcd does, but continuing to accept changes to a corrupted state is extremely bad -- any cryptocurrency with a corresponding bug would totally go to zero.

lavalamp commented 2 years ago

cc @jpbetz, whom I've tried to sell on this in the past :)

lavalamp commented 2 years ago

Looks like Joe previously proposed this in #10893, but that went stale :(

serathius commented 2 years ago

Yep, I have pointed to problem of this issue going stale in https://github.com/etcd-io/etcd/issues/13775 and wanted to propose implementing https://github.com/etcd-io/etcd/issues/10893 as part of graduation of corruption check in https://github.com/etcd-io/etcd/issues/9190

Let's abovid tracking same issue in multiple places. @lavalamp do you want to propose implementing merkele trees as graduation criteria in https://github.com/etcd-io/etcd/issues/9190 and close this one?

lavalamp commented 2 years ago

That sounds great to me! How do I do that?

ptabor commented 2 years ago

+1.

I see 3 options: a) merkle tree should be integrated into bbolt, i.e. we can reliably ask bbolt at each transactional state about it's checksum. This would guarantee bbolt level consistency and is probably the strongest option of physical consistency we can get... but requires impacting the data-storage format and might have biggest performance impact.

b) we somehow arbitrarily partition etcd key-space into markle tree. It's hard to represent the tree-structure of keys, as it's agnostic to etcd. It would be nice for debugging to keep it preserving the key-space continuity (i.e. neighbour-sorted keys are frequently sharing the same markle-tree node) - but that's seems difficult. If we hash each key individually and build the markle tree on the hashes of keys (e.g. bit-groups are defining the nodes structure), this would work but not help isolating the problem for debugging.

c) we don't need merkle tree. We maintain hash of the snapshot + chain of following hashes with all the proposals that are getting applied at the MVCC layer. This would guarantee determinism of actions performed on MVCC, but the inconsistency could originate from inside of mvcc implementation.

serathius commented 2 years ago

That sounds great to me! How do I do that?

Just leave a comment on https://github.com/etcd-io/etcd/issues/9190

lavalamp commented 2 years ago

re: a): I do not think it is a good idea to do this in the storage layer, since it has a bunch of stuff that is irrelevant to the state of the db at a given revision. (e.g., it depends on what has been compacted.) We should not make the etcd replicas hash computation depend on anything historical; it should be stateless, purely a function of the db state at a given revision. That permits e.g. efficiently un-corrupting a replica. It arrives at the correct state, but by an unconventional path.

re: b): I expect the easiest thing to do is a two step process:

  1. compute a key || value hash for every key
  2. put those in a separate patricia-merkle tree (e.g. from memory, bucket by prefix, hash each bucket, make a merkle tree from the bucket hashes; this gives the right properties)

Given an existing such structure and a hypothetical operation, it's easyish to figure out an expected root hash.

re: c): AFAICT if you do it that way, there's no way to efficiently un-corrupt the database. Additionally, even if you have a correct snapshot and a correct replay log, you can have a bug where a transaction gets applied wrong, and that doesn't get detected until the next snapshot is checked. This also requires all etcd replicas to take snapshots at the exact same time; I think that shouldn't be part of the state for correctness checking.

lavalamp commented 2 years ago

OK I commented there, if that's sufficient we can close this.

serathius commented 2 years ago

Based on comments above, there are non trivial decisions to be made, let's keep the issue open and continue the discussion on the design.

xiang90 commented 2 years ago

@ptabor @serathius

Thought about this a long time ago. Never got time to implement though :P Hope it still helps.

Step 1

Step 2

Step 3

Some other follow-up steps to enable the incremental hash in a safe way. Initially, it should be just a checking mechanism, and should not stop the cluster from functioning. As we build more confidence in this hash checking, we can start to rely on it to do hand breaking, etc..

xiang90 commented 2 years ago

A related topic - we should keep the practice of running failure injection tests before any minor releases. (maybe we still do this today?) We used to run at least 3 clusters for about 1 month. It almost always catches inconsistency bugs :P.

jpbetz commented 2 years ago

A related topic - we should keep the practice of running failure injection tests before any minor releases. (maybe we still do this today?) We used to run at least 3 clusters for about 1 month. It almost always catches inconsistency bugs :P.

+1. I vague remember that on of the inconsistency bugs we found a few years back we improved the injection testing to check for failure around some of the restart functionality as a way of trying to prevent future occurrences of that class of issue.

serathius commented 2 years ago

+1 for more failure injection testing. Unfortunately depending on manual testing leads to inconsistency in release qualification, we need to invest more into automation. I'm planning to write a public postmortem that will go into actions we should make to prevent such issues in the future.

shalinmangar commented 2 years ago

How about automated testing based on Jepsen? Is that something that we run today regularly or before a release? I have some experience with it so I can help set it up.

lavalamp commented 2 years ago

Automated testing is good, but the problem is not just etcd bugs that are outside of our hypothesis space (which already testing is iffy at finding) -- it's actually not even sufficient for etcd code to be 100% correct. I'm thinking in particular about RAM, disk, or network-based corruption. The real task is to have the replicas arrive at the same state in the face of this kind of error.

MagicStarTrace commented 2 years ago

etcd: v3.5.2

I added it in the startup item, how can I verify whether it takes effect(experimental-initial-corrupt-check)?

image

It means that 3.5.3 does not need to add the parameter "--experimental-initial-corrupt-check"?

Thank You!

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

serathius commented 1 year ago

This is proposed as postmortem action item and planned for v3.7

lavalamp commented 1 year ago

I've thought some more about this and I think significant gains can be had just from storing a hash of each {key,value} pair, so that corruption of individual values can be detected before building on them. This should also be simpler and a prerequisite to a global hash of some sort anyway.

serathius commented 1 year ago

Instead of checking single KV pairs I would look into adding hashes to bbolt pages. Bbolt implements b+tree which should be extendable to include hash values like merkle tree allowing checking consistency of whole database.

lavalamp commented 1 year ago

The benefit of hashing every KV pair:

Hashing bbolt pages is better than what we do now, but I don't think it's guaranteed that every replica ends up with the same bbolt layout? Also on a hash failure, recovery is very complicated, no?