Closed MasonM closed 4 weeks ago
I've marked the title as breaking here (fix!
) as it is a schema change and requires a long migration, as you mentioned
Fixes #13601
I also don't think this fully fixes #13601 since it does not affect MySQL, is significantly faster than JSON in Postgres, but can still take multiple seconds, and doesn't resolve the description of the issue either, rather another related performance issue brought up in the thread
@agilgur5 Okay, I removed references to https://github.com/argoproj/argo-workflows/issues/13601 from the title and description. The reason I referenced it is because nearly all the comments in that issue (and the user reports) are about general performance issues with PostgreSQL, which this PR addresses, but you're right it doesn't address MySQL or the index optimization mentioned (CREATE INDEX argo_archived_workflows_i5 ON argo_archived_workflows (clustername, startedat)
.
I did try benchmarking the index optimization, but the performance improvement was very small, which is why I didn't include it here:
$ benchstat before.txt after.txt
goos: linux
goarch: amd64
pkg: github.com/argoproj/argo-workflows/v3/test/e2e
cpu: 12th Gen Intel(R) Core(TM) i5-12400
│ before.txt │ after.txt │
│ sec/op │ sec/op vs base │
WorkflowArchive/ListWorkflows-12 185.9m ± ∞ ¹ 188.9m ± ∞ ¹ ~ (p=1.000 n=1) ²
WorkflowArchive/ListWorkflows_with_label_selector-12 199.3m ± ∞ ¹ 185.0m ± ∞ ¹ ~ (p=1.000 n=1) ²
WorkflowArchive/CountWorkflows-12 13.67m ± ∞ ¹ 13.24m ± ∞ ¹ ~ (p=1.000 n=1) ²
geomean 79.71m 77.35m -2.96%
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05
Also
I'm pretty sure that's solely due to the JSON cast issue fixed in https://github.com/argoproj/argo-workflows/pull/13777 (and this PR). I spent over a week trying to reproduce that and I can't. I think it's best to focus on the issues we can reproduce first.
I did try benchmarking the index optimization, but the performance improvement was very small, which is why I didn't include it here:
It would still be better and less complex than a whole subquery IMO
I'm pretty sure that's solely due to the JSON cast issue fixed in #13777
You believe the cast itself adding multiple seconds to a query? Certainly possible, but that would be even more surprising than the index misses or existing JSON inefficiencies with this issue in the past 😅
I think it's best to focus on the issues we can reproduce first.
Sure, I just don't think we should necessarily close out the issue until we've ironed out all the remaining loose items, especially as I made the issue intentionally to cover all the remaining loose items after the previous ones
You believe the cast itself adding multiple seconds to a query? Certainly possible, but that would be even more surprising than the index misses or existing JSON inefficiencies with this issue in the past 😅
It slows it down by over 2x on my machine. But you don't have to take my word for it, because you can reproduce this yourself by following in the instructions in the PR. Just revert the changes made to persist/sqldb/workflow_archive.go
.
Here's the benchstat output I get:
$ benchstat before.txt after.txt
goos: linux
goarch: amd64
pkg: github.com/argoproj/argo-workflows/v3/test/e2e
cpu: 12th Gen Intel(R) Core(TM) i5-12400
│ before.txt │ after.txt │
│ sec/op │ sec/op vs base │
WorkflowArchive/ListWorkflows-12 181.4m ± ∞ ¹ 417.0m ± ∞ ¹ ~ (p=1.000 n=1) ²
WorkflowArchive/ListWorkflows_with_label_selector-12 181.6m ± ∞ ¹ 417.0m ± ∞ ¹ ~ (p=1.000 n=1) ²
WorkflowArchive/CountWorkflows-12 44.88m ± ∞ ¹ 11.59m ± ∞ ¹ ~ (p=1.000 n=1) ²
geomean 113.9m 126.3m +10.90%
¹ need >= 6 samples for confidence interval at level 0.95
² need >= 4 samples to detect a difference at alpha level 0.05
Sure, I just don't think we should necessarily close out the issue until we've ironed out all the remaining loose items, especially as I made the issue intentionally to cover all the remaining loose items after the previous ones
Okay, that's reasonable.
It slows it down by over 2x on my machine.
The cast alone? 👀 👀 As far as I could tell, here and in https://github.com/argoproj/argo-workflows/pull/13777 you mentioned both removing the cast and moving to JSONB
But you don't have to take my word for it, because you can reproduce this yourself by following in the instructions in the PR
I'm pretty sure
You had also said "pretty sure" before, so I thought you hadn't narrowed that down yet
I am not sure how I feel about this change. Whilst I completely agree with the goal of speed, it changes the behaviour of archiving in subtle but surprising ways.
We would end up have two different behaviours depending upon whether you're using mysql or postgres. If you want a 'perfect archive' (which for some users may be something they care about, perhaps compliance reasons), you must use mysql.
I can't see a mechanism whereby we would end up with bugs with duplicate keys, but archived workflows in postgres would hide them after this change.
We should document this change because it's reasonable to expect the contents of an archived workflow to exactly match that of that workflow before archiving, and that won't be the case. Have you considered an option (which most people wouldn't use) to store and present the textual version, but query via jsonb
?
According to this documentation mysql performs some kind of json normalization during json insertion: https://dev.mysql.com/doc/refman/8.4/en/json.html
I checked this also on https://www.db-fiddle.com/. Maybe someone with mysql access could also verify it. However, it seems that currently mysql does not perform perfect archive and this is different behaviour from what current implementation in postgres does.
The cast alone? 👀 👀 As far as I could tell, here and in #13777 you mentioned both removing the cast and moving to JSONB
The comment you linked (https://github.com/argoproj/argo-workflows/issues/13601#issuecomment-2376579589) said "I tested converting json to jsonb and observed times approximately 9s." I'm assuming this was measured with the cast present (i.e. the current behavior), since that's the only scenario where I could reproduce such a slowdown. There may be other scenarios, but there's too many variables involved for me to narrow it down further.
I am not sure how I feel about this change. Whilst I completely agree with the goal of speed, it changes the behaviour of archiving in subtle but surprising ways.
We would end up have two different behaviours depending upon whether you're using mysql or postgres. If you want a 'perfect archive' (which for some users may be something they care about, perhaps compliance reasons), you must use mysql.
As @kodieg correctly pointed out, MySQL is already stripping whitespace and eliminating duplicate keys, so this PR actually makes the behavior more consistent, not less. You can verify this yourself by following these steps:
test.sql
:
insert into argo_archived_workflows (uid, instanceid, name, namespace, phase, clustername, workflow) values ('0b3ec561-bc78-400c-9772-ce11c69b3d65', '', 'dg92nw8qr4k', 'argo', 'Succeeded', 'default', '{ "key2": 1, "key2": 2, "key1": 1}');
select workflow from argo_archived_workflows where uid='0b3ec561-bc78-400c-9772-ce11c69b3d65';
make start PROFILE=mysql
make mysql-cli < test.sql
. Output:
$ make mysql-cli < test.sql
GIT_COMMIT=c9f0376594aed5ae0ebc53d3bf2970300406860f GIT_BRANCH=feat-cli-db-generator2 GIT_TAG=untagged GIT_TREE_STATE=dirty RELEASE_TAG=false DEV_BRANCH=true VERSION=latest
KUBECTX=k3d-k3s-default DOCKER_DESKTOP=false K3D=true DOCKER_PUSH=false TARGET_PLATFORM=linux/arm64
RUN_MODE=local PROFILE=minimal AUTH_MODE=hybrid SECURE=false STATIC_FILES=false ALWAYS_OFFLOAD_NODE_STATUS=false UPPERIO_DB_DEBUG=0 LOG_LEVEL=debug NAMESPACED=true
kubectl exec -ti `kubectl get pod -l app=mysql -o name|cut -c 5-` -- mysql -u mysql -ppassword argo
Unable to use a TTY - input is not a terminal or the right kind of file
mysql: [Warning] Using a password on the command line interface can be insecure.
workflow
{"key1": 1, "key2": 2}
make start PROFILE=postgres
make postgres-cli < test.sql
once on the main
branch and once with this migration applied.
Output on main
:
$ make postgres-cli < test.sql
GIT_COMMIT=c9f0376594aed5ae0ebc53d3bf2970300406860f GIT_BRANCH=feat-cli-db-generator2 GIT_TAG=untagged GIT_TREE_STATE=dirty RELEASE_TAG=false DEV_BRANCH=true VERSION=latest
KUBECTX=k3d-k3s-default DOCKER_DESKTOP=false K3D=true DOCKER_PUSH=false TARGET_PLATFORM=linux/arm64
RUN_MODE=local PROFILE=minimal AUTH_MODE=hybrid SECURE=false STATIC_FILES=false ALWAYS_OFFLOAD_NODE_STATUS=false UPPERIO_DB_DEBUG=0 LOG_LEVEL=debug NAMESPACED=true
kubectl exec -ti `kubectl get pod -l app=postgres -o name|cut -c 5-` -- psql -U postgres
Unable to use a TTY - input is not a terminal or the right kind of file
INSERT 0 1
workflow
-------------------------------------
{ "key2": 1, "key2": 2, "key1": 1}
(1 row)
Output with migration applied:
$ make postgres-cli < test.sql
GIT_COMMIT=c9f0376594aed5ae0ebc53d3bf2970300406860f GIT_BRANCH=feat-cli-db-generator2 GIT_TAG=untagged GIT_TREE_STATE=dirty RELEASE_TAG=false DEV_BRANCH=true VERSION=latest
KUBECTX=k3d-k3s-default DOCKER_DESKTOP=false K3D=true DOCKER_PUSH=false TARGET_PLATFORM=linux/arm64
RUN_MODE=local PROFILE=minimal AUTH_MODE=hybrid SECURE=false STATIC_FILES=false ALWAYS_OFFLOAD_NODE_STATUS=false UPPERIO_DB_DEBUG=0 LOG_LEVEL=debug NAMESPACED=true
kubectl exec -ti `kubectl get pod -l app=postgres -o name|cut -c 5-` -- psql -U postgres
Unable to use a TTY - input is not a terminal or the right kind of file
INSERT 0 1
workflow
------------------------
{"key1": 1, "key2": 2}
(1 row)
Ok, so if we're going to get this in for 3.6 it should have some notes as to its effect in docs/upgrading.md
, and how to do it prior to upgrade.
and how to do it prior to upgrade.
Migrations happen automatically during upgrades, so it only needs a note that it will happen, not how to do it.
and how to do it prior to upgrade.
Migrations happen automatically during upgrades, so it only needs a note that it will happen, not how to do it.
Ah, sorry, that was unclear. I was meaning referring to this in the original issue description:
The good news is this is fully backwards-compatible, and users can opt to manually run
alter table argo_archived_workflows alter column workflow set data type jsonb using workflow::jsonb
in their DB before upgrading. We'll probably want to add a prominent warning about this.
@Joibel Sorry for the delay. I was researching ways that users with a very large number of workflows could do this migration with minimal downtime. I entered https://github.com/argoproj/argo-workflows/pull/13816 with my findings.
Motivation
As explained in https://github.com/argoproj/argo-workflows/issues/13601#issuecomment-2420499551, I believe https://github.com/argoproj/argo-workflows/pull/12912 introduced a performance regression when listing workflows for PostgreSQL users due to TOAST issues. Reverting that PR would solve that particular issue, but it could re-introduce the memory issues mentioned in the PR description. Instead, this mitigates the impact by converting the
workflow
column to be of typejsonb
.Modifications
Initially
workflow
was of typetext
, and was migrated tojson
in https://github.com/argoproj/argo-workflows/pull/2152. I'm not sure whyjsonb
wasn't chosen, but based on this comment in the linked issue, I think it was simply an oversight. It's possible compatibility with older PostgreSQL versions was a concern. Support forjsonb
was introduced in PostgreSQL 9.4, and PostgreSQL 9.3 became end-of-life on November 8, 2018. I don't think we should be concerned about breaking compatibility with EOL PostgreSQL versions, especially versions that have been EOL for nearly 6 years.Here's what the relevant docs say about choosing between
json
andjsonb
:I'm pretty sure we don't care about key order or whitespace. We do care somewhat about insertion speed, but archived workflows are read much more frequently than written, so a slight reduction in write speed that gives a large improvement in read speed is a good tradeoff. As the benchmarks below show, this gives a ~75% performance improvement for calls to
ListWorkflows()
.The biggest problem with this change is the time needed for the migration. With my local DB populated with 100,000 records, it takes ~70s (see below). The good news is this is fully backwards-compatible, and users can opt to manually run
alter table argo_archived_workflows alter column workflow set data type jsonb using workflow::jsonb
in their DB before upgrading. We'll probably want to add a prominent warning about this.Verification
Here's steps to test this:
Use https://github.com/argoproj/argo-workflows/pull/13715 to generate 100,000 randomized workflows, with https://gist.github.com/MasonM/52932ff6644c3c0ccea9e847780bfd90 as a template:
Run the migration the DB CLI:
Note: the
upper: slow query
error is harmless and doesn't impact the migration.Compare results using benchstat: