fix: further optimize archive workflow listing. Fixes #13601

Fixes #13601

Motivation

Listing archived workflows can be slow if you have a very large number of workflows (100,000+), or the average workflow size is high (100KB), even after the optimizations from #13566 and #13779. This makes some additional optimizations that speed up the queries by ~90% on MySQL and ~50% on PostgreSQL.

Modifications

The bottleneck for these queries depended on whether you use MySQL or PostgreSQL, each of which required a different fix. For PostgreSQL, the bottleneck was detoasting overhead, as explained in https://github.com/argoproj/argo-workflows/issues/13601#issuecomment-2420499551. The fix was to use a common table expression to reduce the amount of times workflow needs to be detoasted, as suggested by @kodieg in https://github.com/argoproj/argo-workflows/issues/13601#issuecomment-2421794871. The new query looks like this:

WITH workflows AS (
  SELECT
    "name",
    "namespace",
    "uid",
    "phase",
    "startedat",
    "finishedat",
    coalesce(workflow->'metadata', '{}') as metadata,
    coalesce(workflow->'status', '{}') as status,
    workflow->'spec'->>'suspend' as suspend
  FROM "argo_archived_workflows"
  WHERE (("clustername" = $1 AND "namespace" = $2 AND "instanceid" = $3))
  ORDER BY "startedat" DESC
  LIMIT 100
) (
  SELECT
    "name",
    "namespace",
    "uid",
    "phase",
    "startedat",
    "finishedat",
    coalesce(metadata->>'labels', '{}') as labels,
    coalesce(metadata->>'annotations', '{}') as annotations,
    coalesce(status->>'progress', '') as progress,
    coalesce(metadata->>'creationTimestamp', '') as creationtimestamp,
    "suspend",
    coalesce(status->>'message', '') as message,
    coalesce(status->>'estimatedDuration', '0') as estimatedduration,
    coalesce(status->>'resourcesDuration', '{}') as resourcesduration
  FROM "workflows"
)

For MySQL, the bottleneck was the optimizer inexplicably refusing to use the argo_archived_workflows_i4 index and instead using the primary key, which is much more expensive. As explained by @Danny5487401 in https://github.com/argoproj/argo-workflows/issues/13563#issuecomment-2339660938, two ways of solving that are using FORCE INDEX or adding a union index on (clustername, startedat). Using FORCE INDEX is slightly hacky, and adding a new index is needlessly wasteful when we already have argo_archived_workflows_i4, so I opted to modify that index to cover (clustername, startedat). The new query looks like this:

SELECT
  `name`,
  `namespace`,
  `uid`,
  `phase`,
  `startedat`,
  `finishedat`,
  coalesce(workflow->'$.metadata.labels', '{}') as labels,
  coalesce(workflow->'$.metadata.annotations', '{}') as annotations,
  coalesce(workflow->>'$.status.progress', '') as progress,
  coalesce(workflow->>'$.metadata.creationTimestamp', '') as creationtimestamp,
  workflow->>'$.spec.suspend',
  coalesce(workflow->>'$.status.message', '') as message,
  coalesce(workflow->>'$.status.estimatedDuration', '0') as estimatedduration,
  coalesce(workflow->'$.status.resourcesDuration', '{}') as resourcesduration
FROM `argo_archived_workflows`
WHERE ((`clustername` = ?  AND `namespace` = ? AND `instanceid` = ?))
ORDER BY `startedat` DESC
LIMIT 100

Verification

First, I used https://github.com/argoproj/argo-workflows/pull/13715 to generate 100,000 randomized workflows, with https://gist.github.com/MasonM/52932ff6644c3c0ccea9e847780bfd90 as a template:

PostgreSQL - time go run ./hack/db fake-archived-workflows --template "@very-large-workflow.yaml" --rows 100000
MySQL - time go run ./hack/db fake-archived-workflows --template "@very-large-workflow.yaml" --rows 100000 -d 'mysql:password@tcp/argo'

Then, I ran make BenchmarkWorkflowArchive once on the main branch and once on this branch (with migration applied), and used benchstat to compare:

PostgreSQL

$ benchstat postgres_before.txt postgres_after2.txt 
goos: linux
goarch: amd64
pkg: github.com/argoproj/argo-workflows/v3/test/e2e
cpu: 12th Gen Intel(R) Core(TM) i5-12400
                                                     │ postgres_before.txt │        postgres_after2.txt         │
                                                     │       sec/op        │   sec/op     vs base               │
WorkflowArchive/ListWorkflows-12                              25.110m ± 1%   9.694m ± 3%  -61.39% (p=0.002 n=6)
WorkflowArchive/ListWorkflows_with_label_selector-12           26.14m ± 3%   11.06m ± 3%  -57.70% (p=0.002 n=6)
WorkflowArchive/CountWorkflows-12                              11.84m ± 1%   11.89m ± 2%        ~ (p=0.310 n=6)
geomean                                                        19.81m        10.84m       -45.26%

                                                     │ postgres_before.txt │        postgres_after2.txt         │
                                                     │        B/op         │     B/op      vs base              │
WorkflowArchive/ListWorkflows-12                              497.2Ki ± 0%   499.2Ki ± 0%  +0.40% (p=0.002 n=6)
WorkflowArchive/ListWorkflows_with_label_selector-12          504.0Ki ± 0%   504.9Ki ± 0%  +0.18% (p=0.002 n=6)
WorkflowArchive/CountWorkflows-12                             8.938Ki ± 1%   8.948Ki ± 2%       ~ (p=0.623 n=6)
geomean                                                       130.8Ki        131.1Ki       +0.23%

                                                     │ postgres_before.txt │         postgres_after2.txt         │
                                                     │      allocs/op      │  allocs/op   vs base                │
WorkflowArchive/ListWorkflows-12                               8.370k ± 0%   8.393k ± 0%  +0.27% (p=0.002 n=6)
WorkflowArchive/ListWorkflows_with_label_selector-12           8.405k ± 0%   8.428k ± 0%  +0.26% (p=0.002 n=6)
WorkflowArchive/CountWorkflows-12                               212.0 ± 0%    212.0 ± 0%       ~ (p=1.000 n=6) ¹
geomean                                                        2.462k        2.466k       +0.18%
¹ all samples are equal

MySQL

$ benchstat mysql_before.txt mysql_after.txt 
goos: linux
goarch: amd64
pkg: github.com/argoproj/argo-workflows/v3/test/e2e
cpu: 12th Gen Intel(R) Core(TM) i5-12400
                                                     │ mysql_before.txt │           mysql_after.txt           │
                                                     │      sec/op      │    sec/op     vs base               │
WorkflowArchive/ListWorkflows-12                           43.510m ± 4%   1.650m ± 20%  -96.21% (p=0.002 n=6)
WorkflowArchive/ListWorkflows_with_label_selector-12       69.646m ± 2%   4.513m ± 24%  -93.52% (p=0.002 n=6)
WorkflowArchive/CountWorkflows-12                           27.37m ± 4%   28.98m ± 24%        ~ (p=0.394 n=6)
geomean                                                     43.61m        5.998m        -86.25%

                                                     │ mysql_before.txt │          mysql_after.txt           │
                                                     │       B/op       │     B/op      vs base              │
WorkflowArchive/ListWorkflows-12                           477.3Ki ± 0%   457.6Ki ± 0%  -4.13% (p=0.002 n=6)
WorkflowArchive/ListWorkflows_with_label_selector-12       488.1Ki ± 0%   461.0Ki ± 0%  -5.54% (p=0.002 n=6)
WorkflowArchive/CountWorkflows-12                          7.937Ki ± 1%   7.944Ki ± 0%       ~ (p=0.461 n=6)
geomean                                                    122.7Ki        118.8Ki       -3.22%

                                                     │ mysql_before.txt │           mysql_after.txt           │
                                                     │    allocs/op     │  allocs/op   vs base                │
WorkflowArchive/ListWorkflows-12                            8.224k ± 0%   7.705k ± 0%  -6.31% (p=0.002 n=6)
WorkflowArchive/ListWorkflows_with_label_selector-12        8.277k ± 0%   7.732k ± 0%  -6.58% (p=0.002 n=6)
WorkflowArchive/CountWorkflows-12                            205.0 ± 0%    205.0 ± 0%       ~ (p=1.000 n=6) ¹
geomean                                                     2.408k        2.303k       -4.35%
¹ all samples are equal

argoproj / argo-workflows