Closed PhilC1234 closed 4 months ago
I have the same issue; when I submit a lot of jobs, some of the Armada dependencies such as postgres, pulsar go into a pending state, which can cause the Armada components to crash. My suspicion is that some of the dependencies or Armada components are preempted. I hope this can be fixed soon, as it is a very breaking issue.
Hi all,
Thanks for raising this issue.
In a production setup, you'd always separate the Armada Control Plane in a different cluster which would only run the Control Plane and no Executor.
Executor should be run in separate clusters, and those clusters should primarily be focused on batch workloads.
Even better setup would be to use taint system nodes and run the executor and other system components on the tainted nodes, and use affinity settings for batch workloads so they run on nodes which are reserved for batch workloads.
I am closing this issue as resolved.
Sometimes when I submitted multiple jobs in a queue, I saw Armada components went into Pending state. The lookout UI and the scheduler functions were frozen until some jobs are completed.