Limit overall job number (running and pending)

dangoncalves commented 1 year ago

Please confirm the following

[X] I agree to follow this project's code of conduct.
[X] I have checked the current issues for duplicates.
[X] I understand that AWX is open source software provided for free and that I might not receive a timely response.

Feature type

New Feature

Feature Summary

Currently, there exists some ways to limit fork for a same job, but there is no way to limit overall job number (ie in the entire plateform).

High number of running or pending jobs may lead to an unstable plateform and high CPU consumption for web containers.

It could be useful if there exists a mechanism that permit to limit this in order to have a more stable plateform.

It may be values setted in /settings/jobs. When creating a new job, if job_running_limit is reach then job goes to pending and if job_pending_limit is reach, then job creation is rejected.

Select the relevant components

[X] UI
[X] API
[x] Docs
[ ] Collection
[ ] CLI
[ ] Other

Steps to reproduce

Set a high number of jobs

Current results

High CPU consumption for web containers

Sugested feature result

When creating a new job, if job_running_limit is reach then job goes to pending and if job_pending_limit is reach, then job creation is rejected.

Additional information

No response

fosterseth commented 1 year ago

you can set max concurrent jobs on a particular instance group, see the max concurrent jobs field

pending jobs don't really take up resources, and probably don't need limits. If you are concerned with API requests in general, you would need to set some client side throttling, or scale up web containers. Job creations aren't that expensive to begin with. We've also optimized the task manager to not re-process pending jobs each time the task manager runs.

dangoncalves commented 1 year ago

Hello @fosterseth

Thanks for your attention. Unfortunately this feature does not consider global running jobs number when you have several container groups.

For example, let's say a cluster with 600GB of memory and 2 container groups: the first that handle "normal" jobs (with memory limit set to 500MB), and the second for "big memory" jobs (with memory limit set to 3GB). Currently, if we want to set a job number limit to 200 for the platform, we have to set some arbitrary limits in the two containers group where the sum is equal to 200. But we may have 200 "big memory" jobs and 0 "normal" job, or we may have 0 "big memory" jobs and 200 "normal" jobs, or anything between these extreme limits.

Considering this example, we need a global limit for the platform and limit for the containers group.

fosterseth commented 1 year ago

@kdelee @rebeccahhh any thoughts on a global max concurrent jobs? e.g. if there are 10 different container groups, currently we can only set max concurrent jobs on a single container group

kdelee commented 10 months ago

@dangoncalves I am understanding that you have 2 container groups because you want 2 different pod specs for the jobs (see related issue https://github.com/ansible/awx/issues/12019 ). Is that true?

Are both container groups pointed at the same namespace in the kubernetes cluster?

If so, I think you could implement what you want with a Resource Quota on the kubernetes side. If the resource quota says max memory usage is X or max number running pods is Y and the new job would exceed it, the kube API will reject the request to start the job and controller puts it back in pending.

I know @AlanCoding has criticized this in the past as it does lead to some thrashing in the task manager.

If using the Resource quota does not work, and if the AWX devs agree that a global maximum of running jobs is desirable, I think it would be relatively straight forward to implement.

You could add a method to https://github.com/ansible/awx/blob/2529fdcfd7d5fff5ef46328246d37ce869468ac1/awx/main/scheduler/task_manager_models.py#L259 that would get total jobs running for all instance groups Then you could add a setting GLOBAL_MAX_RUNNING_JOBS (see example https://github.com/ansible/awx/blob/2529fdcfd7d5fff5ef46328246d37ce869468ac1/awx/main/conf.py#L450-L463 you have to add it in a few places)

Finally, in the process_pending_tasks function in https://github.com/ansible/awx/blob/2529fdcfd7d5fff5ef46328246d37ce869468ac1/awx/main/scheduler/task_manager.py#L542-L544 you could check something like (pseudocode):

for task in pending_tasks:
            if self.start_task_limit <= 0:
                break
            if self.timed_out():
                logger.warning("Task manager has reached time out while processing pending jobs, exiting loop early")
                break
+          if self.tm_models.total_running_jobs > settings.GLOBAL_MAX_RUNNING_JOBS:
+              break

AlanCoding commented 10 months ago

Reding https://github.com/ansible/awx/issues/14615#issuecomment-1790756133 I suppose that makes sense. For many users I would expect limiting the "Default" instance group to be sufficient. But using multiple instance groups is an intended use case.

I had trouble understand this request because the case of VMs has an obvious workaround. If you have instances 1, 2, and 3, imagine that "Default" instance group lists 1 and 2, and another instance group is just 3. You could still create an "all" instance group, add all 3 instances to it, and use it to limit the maximum number of jobs. I'm still not totally clear if the task manager validates capacity limits correctly to handle this case, but if it doesn't, it should.

So the need for this setting is probably specific to container groups, although it could be used for any install.

dangoncalves commented 10 months ago

@kdelee: Yes, I have different container groups with differents pod spec. Unfortunately your k8s quota limitation solution works with single AWX instance. If you have 2 or more instances, you cannot use k8s in order to limit pods number while ResourceQuota doesn't allow to select resource regarding a specific label.

@AlanCoding: Yes, this setting is essentially relevant for container groups, but could be useful for VMs in order to limit costs

ansible / awx