test: basic benchmarks for workflow archive DB operations

MasonM commented 1 month ago

Motivation

This adds basic benchmarks for listing and counting archived workflows so we can evaluate potential optimizations, e.g. https://github.com/argoproj/argo-workflows/issues/13601

Modifications

There weren't any other benchmarks in this repo that I could find, so these are the first. By default, go test doesn't run benchmarks unless you supply the -bench flag, so I added a make Benchmark% target for that.

I had to make some adjustments to test/e2e/fixtures/persistence.go so it exposed WorkflowArchive for these tests.

Due to https://github.com/stretchr/testify/issues/811, we can't directly use testify to run these, so I used a workaround.

Verification

Ran make BenchmarkWorkflowArchive locally after using https://github.com/argoproj/argo-workflows/pull/13715 to insert 100,000 rows into argo_archived_workflows:

$ make BenchmarkWorkflowArchive
GIT_COMMIT=f218fc540367840d013a99642590d3509560de51 GIT_BRANCH=feat-postgresql-jsonb GIT_TAG=untagged GIT_TREE_STATE=dirty RELEASE_TAG=false DEV_BRANCH=true VERSION=latest
KUBECTX=k3d-k3s-default DOCKER_DESKTOP=false K3D=true DOCKER_PUSH=false TARGET_PLATFORM=linux/amd64
RUN_MODE=local PROFILE=minimal AUTH_MODE=hybrid SECURE=false  STATIC_FILES=false ALWAYS_OFFLOAD_NODE_STATUS=false UPPERIO_DB_DEBUG=0 LOG_LEVEL=debug NAMESPACED=true
go test --tags api,cli,cron,executor,examples,corefunctional,functional,plugins ./test/e2e -run='BenchmarkWorkflowArchive' -benchmem -bench .
WARN[0000] Non-transient error: <nil>
WARN[0000] Non-transient error: <nil>
goos: linux
goarch: amd64
pkg: github.com/argoproj/argo-workflows/v3/test/e2e
cpu: 12th Gen Intel(R) Core(TM) i5-12400
BenchmarkWorkflowArchive/ListWorkflows-12                      6         167109091 ns/op          527468 B/op       8614 allocs/op
--- BENCH: BenchmarkWorkflowArchive/ListWorkflows-12
    workflow_archive_test.go:27: Found 100 workflows
    workflow_archive_test.go:27: Found 100 workflows
    workflow_archive_test.go:27: Found 100 workflows
    workflow_archive_test.go:27: Found 100 workflows
    workflow_archive_test.go:27: Found 100 workflows
    workflow_archive_test.go:27: Found 100 workflows
    workflow_archive_test.go:27: Found 100 workflows
BenchmarkWorkflowArchive/CountWorkflows-12                    31          36799882 ns/op            9022 B/op        212 allocs/op
--- BENCH: BenchmarkWorkflowArchive/CountWorkflows-12
    workflow_archive_test.go:37: Found 100756 workflows
    workflow_archive_test.go:37: Found 100756 workflows
    workflow_archive_test.go:37: Found 100756 workflows
    workflow_archive_test.go:37: Found 100756 workflows
    workflow_archive_test.go:37: Found 100756 workflows
    workflow_archive_test.go:37: Found 100756 workflows
    workflow_archive_test.go:37: Found 100756 workflows
    workflow_archive_test.go:37: Found 100756 workflows
    workflow_archive_test.go:37: Found 100756 workflows
    workflow_archive_test.go:37: Found 100756 workflows
        ... [output truncated]
PASS
ok      github.com/argoproj/argo-workflows/v3/test/e2e  3.392s

isubasinghe commented 1 month ago

@MasonM This is on my TODO to test out and approve.

Not really related to this PR: It would be nice if the benchmark results were a displayed on PRs as a comment so reviewers can more easily review performance impacts of the PR, it would be some effort to properly get this into CI I'm guessing.

agilgur5 commented 1 month ago

Not really related to this PR: It would be nice if the benchmark results were a displayed on PRs as a comment so reviewers can more easily review performance impacts of the PR, it would be some effort to properly get this into CI I'm guessing.

I've mentioned having "performance regression tests" before (in https://github.com/argoproj/argo-workflows/pull/13566#issuecomment-2336734101 and elsewhere) that would effectively check for, say, 10% or more drops in perf (note that some variability should be always expected as well, as can happen even with the same code on the same machine). Automated comments get missed pretty easily (see CodeCov comments that weren't used in the past) and are non-blocking, so seem insufficient as such. An alternative would be like #13661 , to run nightly and open an issue if there's a big enough drop.

MasonM commented 1 month ago

@agilgur5 That's a good idea, but I don't think that's feasible using the Github-hosted action runners due to noisy neighbors. We frequently see spikes in I/O latency during jobs, which is why I increased the timeout in https://github.com/argoproj/argo-workflows/pull/13769. Those spikes are likely to affect benchmarks and cause them to falsely report large performance regressions.

There's not much we can do about noisy neighbors. If we had access to a self-hosted runner, then we'd have more options.

agilgur5 commented 1 month ago

We do have that problem, but afaik, most tests run within reasonable bounds and we rarely increase the timeouts, usually only due to adding things to the tests or CI jobs themselves (like the new freeing up disk space step as you mentioned in #13769).

So if the frequency is low, I think it would be a worthwhile trade-off to have an occasional test flake that needs rerunning if we prevent performance regressions in doing so.

Perhaps the worst case would be a test that only checks for less than 2x the average run time, which wouldn't catch smaller regressions, but would still catch the worst offenders

MasonM commented 1 month ago

A second problem I forgot to mention is ensuring the baseline results are generated by a runner with comparable performance characteristics to the runner used for the PR results. To be more concrete, imagine you implemented this by having a job run on commits to main that runs the benchmarks, then uploads the results as an artifact called something like baseline_results.txt. Then, a PR job will run the benchmarks, download baseline_results.txt, and compare the two with benchstat.

The problem is the runner used for main could have totally different CPU, I/O, and memory performance than the runner for the PR job. That means the results aren't comparable, and there's no good way of detecting when this happens, except by using self-hosted runners.

Here's an article I found with some numbers: https://aakinshin.net/posts/github-actions-perf-stability/

Unfortunately, default GitHub Actions build agents do not provide a consistent execution environment from the performance point of view. Therefore, performance measurements from different builds can not be compared. This makes it almost impossible to set up reliable performance tests based on the default GitHub Actions build agent pool. ... CPU-bound benchmarks are much more stable than Memory/Disk-bound benchmarks, but the “average” performance levels still can be up to three times different across builds.

MasonM commented 1 month ago

I found this: https://github.com/cncf/cluster

The CNCF Community Infrastructure Lab (CIL) provides free access to state-of-the-art computing resources for open source developers working to advance cloud native computing. We currently offer access to both x86 and ARMv8 bare metal servers for software builds, continuous integration, scale testing, and demonstrations for the CNCF projects.

That would be perfect for this. @agilgur5 should I go ahead and submit a request?

Joibel commented 1 month ago

Do you think the CNCF GitHub action runners would be consistent enough? https://contribute.cncf.io/resources/newsletters/2024-02-21-projectnewsletter/#github-action-runners

MasonM commented 1 month ago

@Joibel It could, but I don't know if https://github.com/argoproj/ is "part of the CNCF’s GitHub Enterprise account". Do you have access to https://github.com/argoproj/argo-workflows/settings/actions/runners? That should be the URL to manage self-hosted runners. It requires admin access to the repository.

agilgur5 commented 3 weeks ago

I don't know if https://github.com/argoproj/ is "part of the CNCF’s GitHub Enterprise account"

Not currently. Crenshaw (CD Lead) actually opened a Service Desk ticket to get access to them in late July: https://cncfservicedesk.atlassian.net/servicedesk/customer/portal/1/CNCFSD-2396 (the link is only accessible to maintainers). Summarizing, apparently CNCF hit some limits and then they were increased by GH and Argo was supposed to be added but still hasn't been, with Zach (Rollouts Lead) last asking for follow-up in mid-Sept.

Joibel commented 2 weeks ago

I've now followed up there as well with a gentle nag - partially so that I'll get notification if the status changes.

argoproj / argo-workflows