imliuda commented 2 months ago

Summary

We enabled node status offload and workflows archiving, and we have observed some performance and stability issues.

there are many slow queries of mysql when running thousands of workflows parallelly, most of which is insert into argo_workflows table, and daily archived workflows cleaning-up may take very long time
if mysql temporarily unavailable, then many workflows will become Error, this is because controller can not hydrate workflow or dehydrate workflow with mysql
there may be workflows failed with message "Dead lock found when trying to get lock; trying restarting transaction"

Use Cases

When running a large cluster, kube-apiserver may be the performance bottleneck, we need offload and archive to improve kube-apiserver performance. currently, it needs some optimisation and improvement to make it stable in a production eviroment.

Some thoughts

add appropriate indexes to archive_workflows and argo_workflows table
compress nodes statues or workflow before save it to mysql
add workflow to work queue again when Hydrate failed, and don't Dehydrate workflows when mysql is unavailable
considering using redis or s3 to save node status or archived workflows

agilgur5 commented 2 months ago

there are many slow queries of mysql when running thousands of workflows parallelly, most of which is insert into argo_workflows table

add appropriate indexes to archive_workflows and argo_workflows table

If it's slow on inserts, I don't think an index would improve that performance. Otherwise it would be good to know what slow queries you see on what version of Argo

compress nodes statues or workflow before save it to mysql

We might be able to do this for status.nodes specifically, but other parts of the status for archived workflows are used during queries (see also #13203).

considering using redis or s3 to save node status or archived workflows

There's an existing feature request for blob storage: #4162. It would be insufficient for archived workflows though, given that they won't be queryable as a blob, as one of the comments lays out (as well as some workarounds). Similarly, I'm not sure Redis/Valkey or similar would suffice for archived workflows queries (it potentially could, but may require some transactions, denormalization, and/or manual indexes).

But both might suffice for status offloading 🤔

and daily archived workflows cleaning-up may take very long time

We might want to add configurable workers for this; i.e. --archive-ttl-workers

imliuda commented 2 months ago

If it's slow on inserts, I don't think an index would improve that performance. Otherwise it would be good to know what slow queries you see on what version of Argo

I add following index, workflows that Error caused by Dead lock found when trying to get lock; trying restarting transaction disappear, I guess it is caused by delete query right next to the successful insert sql in the OffloadNodeStatusRepo.Save() method.

alter table argo_workflows add index `argo_workflows_i2` (`clustername`,`uid`, `version`, `updatedat`);

agilgur5 commented 2 months ago

Hmm looks like the argo_workflows table may not have indexes? #8860 added them for archived workflows (which do have different query patterns).

Would you like to submit a PR to add them for offloaded Workflows?

agilgur5 commented 2 months ago

I guess it is caused by delete query right next to the successful insert sql in the OffloadNodeStatusRepo.Save() method.

Yea that might make more sense since a deletion has to do a lookup. That specific deletion may be removed in #13286

Otherwise it would be good to know what slow queries you see on what version of Argo

This would still be useful -- your Server and Controller logs should show any slow queries

imliuda commented 2 months ago

argo version is v3.4.8, most of slow query is insert.

imliuda commented 2 months ago

When there are lots of slow queries, that delete may cause deadlock or lock wait timeout, I think https://github.com/argoproj/argo-workflows/pull/13286 will help, if this pr is merged, I think it is not necessary of above secondary index. But if we remove that delete code, is there a situation that table records increase continuously due to periodic gc speed is slower then insert?

imliuda commented 2 months ago

@agilgur5 Anton, I opened a pull request, add some code to compress nodes for both offloading and archiving, can you help review this code when you you have time?

agilgur5 commented 2 months ago

argo version is v3.4.8, most of slow query is insert.

@imliuda when providing logs, please use text instead of images, as text is much more accessible.

Otherwise that's a great data point, I'm surprised INSERT is so significantly slower on so many queries. DELETE looks like it could be worthwhile to add an index to as well

I think https://github.com/argoproj/argo-workflows/pull/13286 will help, if this pr is merged, I think it is not necessary of above secondary index

Mmm the index would still make the periodic GC deletes faster

But if we remove that delete code, is there a situation that table records increase continuously due to periodic gc speed is slower then insert?

Technically, I think that's always been possible.

I mentioned above that we could add --archive-ttl-workers to configure this. Maybe --archive-ttl-frequency too.

Oh, this is for offloading though. We might be able to add something similar but I'm actually not familiar with the code for offloading's periodic GC if it works the same way as all other Controller GC (which generally have a configurable number of workers)

RyanDevlin commented 1 month ago

For this proposal, I see it was mentioned "considering using redis or s3 to save node status or archived workflows". Have we considered alternatives to SQL for offloading?

My company is facing similar issues as mentioned here, where we run super large-scale batch workflows which consume more than 50% of etcd under load. Having a performant alternative to etcd as Argo's stateful backing store would greatly benefit us. We also have an organizational aversion to relational databases, so I'm wondering if we've considered things like S3 or DynamoDB for offloading? These are both extremely performant over a network and might be easier to manage than a relational DB?

agilgur5 commented 1 month ago

I did respond to that above already. Given that the current performance is being addressed already (i.e. this issue is likely to close out soon), I would suggest opening a separate feature request for that and referencing this issue and #4162

argoproj / argo-workflows

Improve mysql write performance and stability when offloading #13290

Summary

Use Cases

Some thoughts