armadaproject / armada

A multi-cluster batch queuing system for high-throughput workloads on Kubernetes.
https://armadaproject.io
Apache License 2.0
480 stars 134 forks source link

Lookout DB Pruning Is Slow #4027

Open Sovietaced opened 5 hours ago

Sovietaced commented 5 hours ago

Describe the bug We have enabled the lookout DB pruner and it takes the DB pruning logic quite a lot of time to delete rows. We're currently running through a backlog of old jobs on production and it looks like it will take ~4 days of DB pruning time to get through it.

INFO[2024-10-27T18:49:01.689Z]main.go:152 Pruning database                             
INFO[2024-10-27T18:49:01.719Z]main.go:104 expireAfter: 2160h0m0s, batchSize: 100, timeout: 4h0m0s 
INFO[2024-10-27T18:49:49.283Z]pruner.go:89 Deleted 100 jobs in 46.307731231s. Deleted 100 jobs out of 148715 
INFO[2024-10-27T18:50:39.451Z]pruner.go:89 Deleted 100 jobs in 50.167559246s. Deleted 200 jobs out of 148715 
INFO[2024-10-27T18:51:26.609Z]pruner.go:89 Deleted 100 jobs in 47.158301683s. Deleted 300 jobs out of 148715 
INFO[2024-10-27T18:52:14.788Z]pruner.go:89 Deleted 100 jobs in 48.178640246s. Deleted 400 jobs out of 148715 
INFO[2024-10-27T18:53:02.85Z]pruner.go:89 Deleted 100 jobs in 48.061498405s. Deleted 500 jobs out of 148715 
INFO[2024-10-27T18:53:51.521Z]pruner.go:89 Deleted 100 jobs in 48.671611175s. Deleted 600 jobs out of 148715 
INFO[2024-10-27T18:54:37.485Z]pruner.go:89 Deleted 100 jobs in 45.963408105s. Deleted 700 jobs out of 148715 
INFO[2024-10-27T18:55:28.562Z]pruner.go:89 Deleted 100 jobs in 51.076961904s. Deleted 800 jobs out of 148715 
INFO[2024-10-27T18:56:16.524Z]pruner.go:89 Deleted 100 jobs in 47.961472362s. Deleted 900 jobs out of 148715 
INFO[2024-10-27T18:57:04.41Z]pruner.go:89 Deleted 100 jobs in 47.886411577s. Deleted 1000 jobs out of 148715 

I haven't looked into the query planner yet but I'm assuming there is a lack of indexes for some of these queries and a linear scan is being performed on several million rows. We're running on a db.r7g.large AWS RDS instance.

Sovietaced commented 5 hours ago

Hmm. Looking at the code now it looks like the job_run table has an index on job id. The job table primary key is a job_id so that should be a native index.

I do notice that when the DB pruner runs it pushes the CPU utilization of the DB instance over the 2 CPU max so perhaps the DB pruning is causing some CPU contention for our AWS RDS instance and just slowing down all queries.

dejanzele commented 4 hours ago

@d80tb7 who is best to assist with Lookout?