armadaproject / armada

A multi-cluster batch queuing system for high-throughput workloads on Kubernetes.
https://armadaproject.io
Apache License 2.0
477 stars 134 forks source link

Armada Scheduler Initialization Behavior Can Lead to OOMs #4018

Open Sovietaced opened 1 day ago

Sovietaced commented 1 day ago

During initialization, the Armada Scheduler pulls the entire lifetime of jobs and job runs from the database in order to build its in-memory state. This leads to a memory consumption behavior that is directly related to the number of jobs and job runs in the database. This is a particularly sneaky issue when doing a rolling deployment of Armada after some stability because the Armada Scheduler ends up crash looping if memory requests/limits are not high enough.

I understand there is a pruning functionality to garbage collect jobs and job runs from the database but the initialization behavior of the Armada Scheduler can be made much more efficient considering that for initialization it only needs to materialize state for non-terminal jobs. Afterwards, the normal sync logic that checks for diffs based on the serial number is adequate. Ideally with improvements to the initialization sync behavior the memory consumption should be much more predictable in most use cases.

Below is an example of repeated OOM Killing until the scheduler's resources have been increased. At that point you can see the memory hump during initialization.

Screenshot 2024-10-21 at 2 25 09 PM

We have made a fix for this internally that I'm happy to upstream.

d80tb7 commented 1 day ago

Yes that would be really useful- it's an issue we've been meaning to fix for a while now but haven't got round to. I'd be happy to look at a PR for this.

dejanzele commented 20 hours ago

Hey @Sovietaced,

Thanks for raising this issue!

We are looking forward to your PR.

Feel free to reach out if you need any support.

Sovietaced commented 11 hours ago

I'll have something up in a few days, gotta polish it up :)