rooftopcellist commented 10 months ago

Summary

AnsibleJob resources fail when run at scale because the AWX api/v2/job_templates endpoint can get very slow when there are many concurrent running jobs.

Details

I created 200 ansiblejobs at once and noticed that a number of them failed on this task because of a timeout caused by AWX taking too long to respond:

TASK [job_runner : Register Job result when complete] **************************
fatal: [localhost]: FAILED! => {"changed": false, "msg": "There was an unknown error when trying to connect to https://awx-awx.apps.aap-bugbash.ocp4.testing.ansible.com/api/v2/jobs/251/: timeout The read operation timed out"}

I did a couple GET's to the endpoint in question and the request times from the API were ranging from .3 to 3 seconds.

X-API-Product-Name: AWX
X-API-Product-Version: 23.4.0
X-API-Time: 2.612s

Analysis & Solutions

Problem 1: The job_templates API endpoint can't keep up

We could adjust the timeout on the awx.awx collection invocation in the runner role, or we could try to solve this on the awx api side... But I'm not sure what we could do there to reduce response time.

Problem 2: Operator pod can get OOMKilled

The jobs failing leads to jobs getting backed up and results in more concurrent running AnsibleJob reconciliation loops and pods for the resource operator.

Solutions:

We could increase the "requests" for the resource-operator's pod because in combination with the pods spun up to run each job, it could exhaust the free memory available on that node, resulting in the OOMKilled operator pod.
Better solution would be to employ a priority class
- higher priority class value for the operator pod than the job pods so that the operator pod doesn't get reaped

Supporting information:

When creating 100 AnsibleJob's at once, I see that about half of them fail with the timeout from the jobtemplates API endpoint. Also, ~ 50 ansiblejob related containers end up running concurrently (each retries once, so it doubles the container count per ansiblejob).

$ kubectl top pods | wc -l
53

Total CPU Usage: 3485 millicores (m)
Total Memory Usage: 6157 Mi

So the high number of jobs running, eats up the underlying node's memory, to be point that it starts needing to kill of pods.

rooftopcellist commented 10 months ago

More information:

AnsibleJob objects and k8s job objects are a one-to-one mapping
Each k8s job will retry once, so if it fails, you can end up with 2 pods.

So for example that I see in my test right now:

the # of k8s jobs/ansiblejobs is 57
the # of pods running jobs is 76
76 - 57 == 19 extra pods

To check this, I queried for how many pods were in the failing state and lo and behold...

the # of pods failing because of timeout is 19.

$ oc get pod | grep "Error" | wc -l
19

rooftopcellist commented 10 months ago

The failed AnsibleJob runs are actually succeeding in launching jobs in AWX, I now have over 600 jobs in AWX, all originating from the 100 AnsibleJob objects I created via the resource-operator.

A quick query to the AWX API shows that there are 155 jobs in pending state because we have overwhelmed AWX.

https://awx-awx.apps.aap-bugbash.ocp4.testing.ansible.com/api/v2/jobs/?status=pending

...
    "count": 155,

rooftopcellist commented 10 months ago

This is the memory usage of the operator pod when it has 100 ansiblejobs in queue (with default of 4 ansiblejob workers). I think we should set our requests accordingly.

Monitoring Pod: resource-operator-controller-manager-86f85b765c-st8r7
+------------------+-----------+-----------+-----------+
| Resource         |   Average |   Minimum |   Maximum |
+==================+===========+===========+===========+
| CPU (millicores) |  1033.73  |       449 |      1306 |
+------------------+-----------+-----------+-----------+
| Memory (Mi)      |   485.429 |       211 |       599 |
+------------------+-----------+-----------+-----------+

ansible / awx-resource-operator

AnsibleJobs fail at scale due to API timeout #152

Summary

Details

Analysis & Solutions