armadaproject / armada-operator

Apache License 2.0
13 stars 10 forks source link

Prevent Armada Operator and Armada Server into preemptive/pending state #312

Closed PhilC1234 closed 4 months ago

PhilC1234 commented 4 months ago

Sometimes when I submitted multiple jobs in a queue, I saw Armada components went into Pending state. The lookout UI and the scheduler functions were frozen until some jobs are completed.

Screenshot 2024-07-15 ArmadaPending

wilson-duan commented 4 months ago

I have the same issue; when I submit a lot of jobs, some of the Armada dependencies such as postgres, pulsar go into a pending state, which can cause the Armada components to crash. My suspicion is that some of the dependencies or Armada components are preempted. I hope this can be fixed soon, as it is a very breaking issue.

dejanzele commented 4 months ago

Hi all,

Thanks for raising this issue.

In a production setup, you'd always separate the Armada Control Plane in a different cluster which would only run the Control Plane and no Executor.

Executor should be run in separate clusters, and those clusters should primarily be focused on batch workloads.

Even better setup would be to use taint system nodes and run the executor and other system components on the tainted nodes, and use affinity settings for batch workloads so they run on nodes which are reserved for batch workloads.

dejanzele commented 4 months ago

I am closing this issue as resolved.