etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
46.77k stars 9.64k forks source link

robustness: reduce concurrency for HighTraffic scenario #18208

Open MadhavJivrajani opened 1 week ago

MadhavJivrajani commented 1 week ago

Which Github Action / Prow Jobs are flaking?

ci-etcd-robustness-main-amd64

Which tests are flaking?

TestRobustnessExploratoryKubernetesHighTrafficClusterOfSize1

Github Action / Prow Job link

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-etcd-robustness-main-amd64/1803129487966081024

Reason for failure (if possible)

Linearization validation is timing out! We should decrease the number of concurrent clients, you can see a lot of concurrent LISTs being issued:

Screenshot 2024-06-20 at 3 27 52 PM

Anything else we need to know?

Let's try changing concurrency to 10 here: https://github.com/etcd-io/etcd/blob/00a609751aba3e3e4c3a59b5eaa58297ed2e0b87/tests/robustness/traffic/traffic.go#L48

fykaa commented 1 week ago

I understand that reducing the number of concurrent clients might help with the timeouts. I'm curious, though: Is there a specific reason we initially set the ClientCount to 12? Could there be potential trade-offs or impacts on the test coverage and robustness if we reduce it to 10? I'm keen to understand the broader context before making changes.

MadhavJivrajani commented 1 week ago

GREAT question @fykaa 💯

Here's my understanding of it, but I'll let @serathius chime in to provide additional context that I maybe missing.

I think 12 was arbitrary, I don't recall there being a specific reason for it.

I think wrt coverage we should be good in terms of the properties we want to test. It would've been a concern if we chose fail points at random per client, but we do that before we start simulating traffic, so we are good even in terms of the code we are hitting:

https://github.com/etcd-io/etcd/blob/00a609751aba3e3e4c3a59b5eaa58297ed2e0b87/tests/robustness/main_test.go#L77

Having higher concurrency is beneficial because it can help us test more realistic and complex histories from a linearization perspective and how etcd handles the load, however if we can make do with 10 clients then that validation is also good enough. And plus the upside is we can get a clearer signal out of robustness tests if we can reduce the flake rate that happens because of this specific timeout.

fykaa commented 1 week ago

Thanks, @MadhavJivrajani! That makes a lot of sense. I appreciate the explanation about the trade-offs between higher concurrency and reducing flakiness.

Also, I am trying to understand, are there any other areas in the robustness tests where we might need to adjust configurations to balance between test coverage and reliability?

MadhavJivrajani commented 5 days ago

@fykaa these are most of the knobs that exist: https://github.com/etcd-io/etcd/blob/9314ef760d889dcca8eed16d4abc107e5d84daca/tests/robustness/traffic/traffic.go#L33-L51

https://github.com/etcd-io/etcd/blob/9314ef760d889dcca8eed16d4abc107e5d84daca/tests/robustness/traffic/kubernetes.go#L35-L47

Feel free to assign the PR to me once you have it up.

serathius commented 5 days ago

Hey @MadhavJivrajani, I have been looking at the recent failures and found the reason for linearization for timeout. https://github.com/etcd-io/etcd/pull/18214 should already help a lot (reduced linearization time from timeout to 15s on my machine). I also have a draft that should help reduce it back to 6s. If everything works well we might not need to reduce the concurrency at all.

serathius commented 2 days ago

Or based on the today's robustness test results I'm wrong. Sorry for confusion but results cannot be guaranteed until we make a change and wait a day for the CI to get confirmation.

Linearization timeout in CI in 2 tests on 5 minutes https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-etcd-robustness-amd64/1805934408956383232

On my local machine it succeeded after 118s and 141s. Even with my latest ideas (https://github.com/etcd-io/etcd/compare/main...serathius:etcd:robustness-history-patching) I only take improve the time for one of the cases (141s -> 1s). The other 118s is just lowered to 116s, which is still expected to timeout on CI :(

Based on the linearization results, I'm not surprized. Compaction caused watch events to not be delivered, so we don't know a lot of revisions. Selection_018

serathius commented 2 days ago

Might try another approach with using full on etcd model to generate the revisions, I don't like the idea too much as if the model has a bug this will make it much harder to debug.

MadhavJivrajani commented 1 day ago

@serathius

full on etcd model to generate the revisions

Wouldn't this also increase the work porcupine has to do?

model has a bug this will make it much harder to debug.

+1

@fykaa do you want to send in a PR? Maybe we can try letting that bake in the CI for a a couple of days to see how it works out and can then take a call based on if we should revert, keep it that way or change something.

serathius commented 1 day ago

One request, before me make changes in the concurrency I would like to remove all other flake sources. https://github.com/etcd-io/etcd/issues/18240

Then we can measure if we are happy with the timeout chance and if not reduce the concurrency.

fykaa commented 27 minutes ago

Thanks for the detailed insights, everyone!

@MadhavJivrajani, I will go ahead and create a PR to reduce the client concurrency from 12 to 10 as discussed.

@serathius, I understand that other flake sources need to be addressed as well, I would raise the PR now, and once you've resolved the other flakes, we can re-run the tests to get a clearer picture of the impact?

Thanks again for all the guidance!