Open serathius opened 1 year ago
cc @tjungblu You were interested in etcd model performance.
I would like to help. @serathius would you mind to assign it to me? Thanks
Sure, thanks for offering. I did some simple investigation by collecting pprof from github action in https://github.com/serathius/etcd/actions/runs/3901079392 (available in artifacts file)
Maybe it would be useful
It showed that json took ~27% of cpu time. However one discrepancy is that it recorded only 40s from 15m github action. Either I didn't collect profile correctly or there are other things that should be optimized.
@fuweid feel free to pick this up, I've spent some time last year to move porcupine to generics in: https://github.com/tjungblu/porcupine
That way you can skip the json serde and without the casting it's a bit quicker in general. If the checking itself is taking longer due to an exponential search space increase, it's not going to help much however ;)
If the checking itself is taking longer due to an exponential search space increase, it's not going to help much however ;)
Overall you are right, however with https://github.com/etcd-io/etcd/pull/15078 chances that space grows exponentially were greatly reduced.
Interesting I found one seemingly unrelated culprit. With network proxy enabled 3 member clusters takes 8 seconds to shutdown.
This should already help https://github.com/etcd-io/etcd/pull/15091
Still investigating the cause.
And the github action page shows total duration including pending time, which is not correct. Wrote a tool to dump the job run duration. Parse the result here.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.
@serathius Is the issue still valid after #15078 #15242?
@fuweid feel free to pick this up, I've spent some time last year to move porcupine to generics in: https://github.com/tjungblu/porcupine
That way you can skip the json serde and without the casting it's a bit quicker in general.
Moving porcupine to generics seems another area to reduce the runtime.
Issue is still valid, however I think we need to change the approach. Profiling done by me looks incorrect. Don't think rewrite to generics will help. We can make the code faster, but I don't think it will reduce the test runtime. I think we lack proper instrumentation of tests to measure runtime and the contributing factors.
I'm thinking about collecting reports about how much each tests took. Similar to what we do for flakiness measurement. The test should output junit report, that we can later aggregate over hundreds of runs, and calculate some statistics instead of eyeballing the useless Github action page.
Thanks @serathius! I think it's a good idea to first narrow down if a single test case takes the majority of the runtime or it's a universe pattern in robustness test. After that, we might use profiling to identify which code path is unexpectedly slow. Created an issue with background to track this ^
Don't think rewrite to generics will help. We can make the code faster, but I don't think it will reduce the test runtime.
It won't help that much, IIRC the difference was about 20% at best. There was a trade-off between copying the state at every step (json serde) and doing a plain memcopy of a struct (plus the casting overheads).
Especially if the culprit is slow GH action CPUs. +1 on the instrumentation though. I'm still a big fan of the pebble benchmark page: https://cockroachdb.github.io/pebble/?max=local
Maybe we can come up with something similar, obviously for test suite runtimes first. Even though I expect significant noise/variance from the action runners.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.
What would you like to be added?
There was a recent runtime increase for linearizability tests.
Execution time almost doubled, for example for presubmit it went from ~18m to 30m and for nightly it went from 2h to 3h.
My main expectation is because of growing complexity of etcd model. Based on my experience running the test, recent improvements have reduced variance of run time, however it increased overall time.
For now I totally ignored performance for simplicity (like use json to serialize model), however we should start looking into low hanging fruit and fix it.
My recommendation would be to start profiling to identify places for improvement.
Why is this needed?
Increased run time started causing failures in run. 30 minutes for presubmit is also too long.