etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.36k stars 9.72k forks source link

HashKV should compute the hash of all MVCC keys up to a given revision #18337

Open fuweid opened 1 month ago

fuweid commented 1 month ago

Bug report criteria

What happened?

Based on https://etcd.io/docs/v3.5/dev-guide/api_reference_v3/, HashKV api should compute the hash of all MVCC keys up to a given revision, even if the key/value is tombstone. However, HashKV could skip compacted revision.

https://github.com/etcd-io/etcd/blob/e7f572914d79f0705b3dc8ca28d9a14b0f854d49/server/storage/mvcc/hash.go#L60-L76

For the compacted revision, the lower.GreaterThan(kr) always return true. Assume that the compacted revision is for key A. If the key A has revisions higher than compacted revision, the HashKV won't compute the hash on compacted revision.

Case 1: ETCD contains more than two keys

For example, there are two keys in store.

// key: "foo"
// modified: 9
// generations:
//    {{9, 0}[1]}
//    {{5, 0}, {6, 0}, {7, 0}, {8, 0}(t)[4]}
//    {{2, 0}, {3, 0}, {4, 0}(t)[3]}

// key: "foo2"
// modified: 2
// generations:
//    {{2, 1}[1]}

After compaction on Revision{Main: 5}, the store will be like

// key: "foo"
// modified: 9
// generations:
//    {{9, 0}[1]}
//    {{5, 0}, {6, 0}, {7, 0}, {8, 0}(t)[4]}

// key: "foo2"
// modified: 2
// generations:
//    {{2, 1}[1]}

When we perform HashKV on Revision{Main: 7}, the hash should compute the following revisions:

However, HashKV skips Revision{Main: 5} because the compacted revision is {Main: 5} and kvindex.Keep returns two revisions {Main: 2, Sub: 1} and {Main: 7}, The Revision{Main: 5} isn't in result of kvindex.Keep.

Case 2: ETCD contains only one key

There is key foo.

// key: "foo"
// modified: 9
// generations:
//    {{9, 0}[1]}
//    {{5, 0}, {6, 0}, {7, 0}, {8, 0}(t)[4]}
//    {{2, 0}, {3, 0}, {4, 0}(t)[3]}

After compaction on Revision{Main: 5}, the store will be like

// key: "foo"
// modified: 9
// generations:
//    {{9, 0}[1]}
//    {{5, 0}, {6, 0}, {7, 0}, {8, 0}(t)[4]}

When we perform HashKV on Revision{Main: 7}, the hash should compute the following revisions:

HashKV skips Revision{Main: 5} because the compacted revision is {Main: 5} and kvindex.Keep returns one revision {Main: 7}.

However, when we perform HashKV on Revision{Main: 8}, the kvindex.Keep will return empty keep because the Revision{Main: 8} is tombstone. So, the hash result will compute the following revisions, which is expectation.

What did you expect to happen?

HashKV should compute the hash of all MVCC keys up to a given revision

How can we reproduce it (as minimally and precisely as possible)?

https://github.com/fuweid/etcd/commit/a6454f2fcc13aa116086b4df907541d3942b3b00

$ cd server/storage/mvcc
$ go test -v -run TestHashKVImpactedByKeep -count=1 ./

Anything else we need to know?

The HashKV was introduced by #8263. IMO, the keep, available revision map, was used to skip the revisions which were marked deleted in on-going compaction. If there is no compaction, the keep is always empty and then HashKV will compute all the MVCC keys up to the revision.

8263 has bug and then #8333 introduced keyIndex.Keep interface to fix bug.

However, after #8333, the non-empty keep map forces HashKV to skip compacted revision.

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

# paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

```console $ etcdctl member list -w table # paste output here $ etcdctl --endpoints= endpoint status -w table # paste output here ```

Relevant log output

No response

ishan16696 commented 1 month ago

Value return by HashKV api call till x revision should be equal to the Hash of snapshot taken upto x revision (removing the appended hash). Is this correct @fuweid ?

IMO, this is not correct as HashKV api call calculates the hash of all MVCC key-values, whereas snapshot is snapshot of etcd db which also contains cluster information, hence the hash will not be same.

fuweid commented 1 month ago

Hi @ishan16696

the Hash of snapshot taken upto x revision

I don't follow this. Do you mean Hash API? If so, Hash will compute all MVCC keys, confState and other buckets in bbolt.

https://github.com/etcd-io/etcd/blob/e7f572914d79f0705b3dc8ca28d9a14b0f854d49/api/etcdserverpb/rpc.proto#L226-L238

ishan16696 commented 1 month ago

Hi @fuweid , What I mean by snapshot upto x revision is:

Than to validate the integrity of this snapshot, If I calculate the hash this snapshot taken via snapshot api call upto x revision(removing appended hash) , will it be same as value return by Hash API call ?

ishan16696 commented 1 month ago

Anyway, my doubt is different from this issue, I have now open a separate issue now: https://github.com/etcd-io/etcd/issues/18340

fuweid commented 1 month ago

Than to validate the integrity of this snapshot, If I calculate the hash this snapshot taken via snapshot api call upto x revision(removing appended snapshot)

The Snapshot uses sha256sum to compute the whole database.

https://github.com/etcd-io/etcd/blob/e7f572914d79f0705b3dc8ca28d9a14b0f854d49/server/etcdserver/api/v3rpc/maintenance.go#L145

However, both Hash and HashKV are using crc32.New(crc32.MakeTable(crc32.Castagnoli)). They are different.

will it be same as value return by Hash API call ?

Unfortunately, it's not.

ahrtr commented 1 month ago

Compaction is tightly coupled with HashKV, we have to consider/discuss them together. We need to get consensus on https://github.com/etcd-io/etcd/pull/18274#issuecomment-2223188506 firstly.

As mentioned in that comment, compatibility (ensure new version and old version generate the same hash value) is also a big concern. We need to leverage the cluster-level feature to resolve it.

Probably it would be better to turn that into a KEP "Consistent Compaction and HashKV". cc @ivanvc @jmhbnz @serathius @siyuanfoundation