Improving Argo API/DB performance

gyanprakash48 commented 4 days ago

Pre-requisites

[X] I have double-checked my configuration
[X] I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
[X] I have searched existing issues and could not find a match for this bug
[X] I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

Currently , when large number of workflows are running , Workflows view of Argo workflows UI is taking up to 12 seconds to load. I don't find any way to further optimise it. CPU and Memory uses of server pod are normal. I am using Postgres to archive the workflows and cpu and memory usage of database as well are normal. i don't see any spikes anywhere. So looks like its not resource contention but something else, What can be done to improve the load time .

Version

3.5.0, 3.5.6, 3.5.8,

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Its UI load time (workflow view)

Logs from the workflow controller

Its UI load time (workflow view)

Logs from in your workflow's wait container

Its UI load time (workflow view)

gyanprakash48 commented 2 days ago

i tried to trace it and seem like below query is culprit . As per query performance insight(Azure Postgres flexi server) , this query execution mean time is 9.304s, Which explains, over all api time is 12-13 seconds

select
    name,
    namespace,
    uid,
    phase,
    startedat,
    finishedat,
    coalesce((workflow::json)->'metadata'->>'labels',
    '{}') as labels,
    coalesce((workflow::json)->'metadata'->>'annotations',
    '{}') as annotations,
    coalesce((workflow::json)->'status'->>'progress',
    '') as progress,
    coalesce((workflow::json)->'metadata'->>'creationTimestamp',
    '') as creationtimestamp,
    (workflow::json)->'spec'->>'suspend' as suspend,
    coalesce((workflow::json)->'status'->>'message',
    '') as message,
    coalesce((workflow::json)->'status'->>'estimatedDuration',
    '0') as estimatedduration,
    coalesce((workflow::json)->'status'->>'resourcesDuration',
    '{}') as resourcesduration
from
    "argo_archived_workflows"
where
    (("clustername" = $1
        and "namespace" = $2
        and "instanceid" = $3)
    and "namespace" = $4
    and not exists (
    select
        1
    from
        argo_archived_workflows_labels
    where
        clustername = argo_archived_workflows.clustername
        and uid = argo_archived_workflows.uid
        and name = 'workflows.argoproj.io/controller-instanceid'))
order by
    "startedat" desc
limit 50

gyanprakash48 commented 2 days ago

These json functions are culprit,

coalesce((workflow::json)->'metadata'->>'labels',
'{}') as labels,
coalesce((workflow::json)->'metadata'->>'annotations',
'{}') as annotations,
coalesce((workflow::json)->'status'->>'progress',
'') as progress,
coalesce((workflow::json)->'metadata'->>'creationTimestamp',
'') as creationtimestamp,
(workflow::json)->'spec'->>'suspend' as suspend,
coalesce((workflow::json)->'status'->>'message',
'') as message,
coalesce((workflow::json)->'status'->>'estimatedDuration',
'0') as estimatedduration,
coalesce((workflow::json)->'status'->>'resourcesDuration',

driv commented 2 days ago

I've been testing the queries from the logs of 3.5.8 and I can see a huge difference between MySQL 5.7 and 8.

The queries execution plans are different. On 8 it executes more than 2 orders of magnitude faster.

I'm attaching the queries EXPLAIN.

explain.57.md explain.8.md

MySQL 8 is able to use range.

gyanprakash48 commented 2 days ago

hmm, i am using Postgres Flexi server , so not sure if it comparable. What about storing these columns as generated column in the table, and in api query using those generated columns. something like below. I tested it for same data where currently it is taking 9 sec, revised query completing in milliseconds. does it have any downside ? Get Workflows route is most frequently used in UI and its slowness taking away all the charm of argo workflows (Users always feel frustrated).

table DDL

CREATE TABLE argo_archived_workflows (
    uid varchar(128) NOT NULL,
    "name" varchar(256) NOT NULL,
    phase varchar(25) NOT NULL,
    "namespace" varchar(256) NOT NULL,
    workflow json NOT NULL,
    startedat timestamp DEFAULT CURRENT_TIMESTAMP NOT NULL,
    finishedat timestamp DEFAULT CURRENT_TIMESTAMP NOT NULL,
    clustername varchar(64) NOT NULL,
    instanceid varchar(64) NOT NULL,
    labels text GENERATED ALWAYS AS ((workflow::json)->'metadata'->>'labels') stored,
    annotations text GENERATED ALWAYS as ( (workflow::json)->'metadata'->>'annotations') stored,
    progress text GENERATED ALWAYS AS ((workflow::json)->'status'->>'progress') stored,
    creationtimestamp text GENERATED ALWAYS AS ((workflow::json)->'metadata'->>'creationTimestamp') stored,
    suspend text GENERATED ALWAYS AS ((workflow::json)->'spec'->>'suspend') stored,
    message text GENERATED ALWAYS AS ((workflow::json)->'status'->>'message') stored,
    estimatedduration text GENERATED ALWAYS AS ((workflow::json)->'status'->>'estimatedDuration') stored,
    resourcesduration text GENERATED ALWAYS AS ((workflow::json)->'status'->>'resourcesDuration') stored,  
    CONSTRAINT argo_archived_workflows_pkey PRIMARY KEY (clustername, uid)
);
CREATE INDEX argo_archived_workflows_i1 ON argo_archived_workflows USING btree (clustername, instanceid, namespace);
CREATE INDEX argo_archived_workflows_i2 ON argo_archived_workflows USING btree (clustername, instanceid, finishedat);
CREATE INDEX argo_archived_workflows_i3 ON argo_archived_workflows USING btree (clustername, instanceid, name);
CREATE INDEX argo_archived_workflows_i4 ON argo_archived_workflows USING btree (startedat);

Revised Query

select
    name,
    namespace,
    uid,
    phase,
    startedat,
    finishedat,
    coalesce(labels,'{}') as labels,
    coalesce(annotations,'{}') as annotations,
    coalesce(progress,'') as progress,
    coalesce(creationtimestamp,'') as creationtimestamp,
    suspend,
    coalesce(message,'') as message,
    coalesce(estimatedduration,'0') as estimatedduration,
    coalesce(resourcesduration,'{}') as resourcesduration
from
    argo_archived_workflows
where
    (("clustername" = $1
        and "namespace" = $2
        and "instanceid" = $3)
    and "namespace" = $4
    and not exists (
    select
        1
    from
        argo_archived_workflows_labels
    where
        clustername = 'default'
        and uid = argo_archived_workflows.uid
        and name = 'workflows.argoproj.io/controller-instanceid'))
order by startedat desc
limit 45;

driv commented 2 days ago

Given that the query has a limit 50 and those fields are not used in filtering or sorting, I doubt they are responsible for the long query duration.

Can you share an EXPLAIN?

gyanprakash48 commented 2 days ago

in the beginning i though that too, but trying different strategy to switch different explain plan was not helping at all . Then i comment out those json function column in select, and suddenly query was super fast. Even Though they are not part of sorting and filtering, but these json are very big in size(If workflows is large , at least thats the case in my situation)

gyanprakash48 commented 2 days ago

log May be this recording will help further. I am sure this is only, one of the of the scenario (must be many others)

i ran both aspect twice (just to avoid concern like catch an all )

agilgur5 commented 11 hours ago

Then i comment out those json function column in select, and suddenly query was super fast. Even Though they are not part of sorting and filtering, but these json are very big in size(If workflows is large , at least thats the case in my situation)

Yea they can be very large, though I'm surprised it takes that long, since it's not part of the filter. I'm wondering if the engine is extracting the JSON multiple times, once for each field?

I tested it for same data where currently it is taking 9 sec, revised query completing in milliseconds. does it have any downside ?

  labels text GENERATED ALWAYS AS ((workflow::json)->'metadata'->>'labels') stored,
  annotations text GENERATED ALWAYS as ( (workflow::json)->'metadata'->>'annotations') stored,
  progress text GENERATED ALWAYS AS ((workflow::json)->'status'->>'progress') stored,
  creationtimestamp text GENERATED ALWAYS AS ((workflow::json)->'metadata'->>'creationTimestamp') stored,
  suspend text GENERATED ALWAYS AS ((workflow::json)->'spec'->>'suspend') stored,
  message text GENERATED ALWAYS AS ((workflow::json)->'status'->>'message') stored,
  estimatedduration text GENERATED ALWAYS AS ((workflow::json)->'status'->>'estimatedDuration') stored,
  resourcesduration text GENERATED ALWAYS AS ((workflow::json)->'status'->>'resourcesDuration') stored,

We don't currently filter on those, so it's a little unnecessary as individual columns, but otherwise that makes sense to me if it's compatible with MySQL as well. I'm wondering if we can just use a materialized view instead for similar effect.

cc @jiachengxu @terrytangyuan @jessesuen

argoproj / argo-workflows