etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.36k stars 9.72k forks source link

Add hashKV checking to robustness test #18386

Open serathius opened 1 month ago

serathius commented 1 month ago

What would you like to be added?

https://github.com/etcd-io/etcd/pull/18274 proposes to change very delicate part of compaction to fix a correctness bug. There is high concern in this change potentially changing result of hashKV which can lead to cluster incorrectly marking itself as corrupted. We need to build a very solid confidence in hashKV calculation which regards to traffic and compaction.

Robustness tests seem like a good candidate to build that confidence even further, the main challenge in adding tests like in https://github.com/etcd-io/etcd/pull/18369 is fact that we cannot easily cover all ways requests (PUT/DELETE/TXN/Compaction) can interleave. With Robustness generating random requests it seems that we just need to start checking hashKV.

Robustness tests could be periodically requesting hashKV from all members and validating them for equality.

Why is this needed?

Would like to generalize testing for https://github.com/etcd-io/etcd/pull/18369.

serathius commented 1 month ago

cc @ahrtr @fuweid @henrybear327 @jmhbnz @siyuanfoundation @ah8ad3

ahrtr commented 1 month ago

Three cases I can think of,

But we need to exclude false alarm. The compaction is executed in the applying workflow/path, but hashKV is executed in the API layer. The CompactRevision must be the same in the HashKVResponse of all members (the same for HashRevision), otherwise it makes no sense to compare the Hash.

https://github.com/etcd-io/etcd/blob/739a9b60b07eaa519428a6c1f364381e20141c52/api/etcdserverpb/rpc.pb.go#L1634-L1638

ahrtr commented 1 month ago

@ArkaSaha30

henrybear327 commented 1 month ago

/assign @ArkaSaha30 /assign @henrybear327

We would like to work on this together as this also seems to be a good material for the talk that we might be preparing for!

k8s-ci-robot commented 1 month ago

@henrybear327: GitHub didn't allow me to assign the following users: ArkaSaha30.

Note that only etcd-io members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to [this](https://github.com/etcd-io/etcd/issues/18386#issuecomment-2260788229): >/assign @ArkaSaha30 >/assign @henrybear327 > >We would like to work on this together as this also seems to be a good material for the talk that we might be preparing for! Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
ArkaSaha30 commented 1 month ago

/assign

fuweid commented 1 month ago

Sound good to me! Especially, it can help us to do sanity and compatible check on mix-versions cluster.

ah8ad3 commented 1 month ago

I don't know if anyone needs an extra pair of hands i'm here just ping me @ArkaSaha30 @henrybear327