ansible / awx-resource-operator

41 stars 34 forks source link

AnsibleJobs fail at scale due to API timeout #152

Open rooftopcellist opened 10 months ago

rooftopcellist commented 10 months ago

Summary

AnsibleJob resources fail when run at scale because the AWX api/v2/job_templates endpoint can get very slow when there are many concurrent running jobs.

Details

I created 200 ansiblejobs at once and noticed that a number of them failed on this task because of a timeout caused by AWX taking too long to respond:

TASK [job_runner : Register Job result when complete] **************************
fatal: [localhost]: FAILED! => {"changed": false, "msg": "There was an unknown error when trying to connect to https://awx-awx.apps.aap-bugbash.ocp4.testing.ansible.com/api/v2/jobs/251/: timeout The read operation timed out"}

I did a couple GET's to the endpoint in question and the request times from the API were ranging from .3 to 3 seconds.

X-API-Product-Name: AWX
X-API-Product-Version: 23.4.0
X-API-Time: 2.612s

Analysis & Solutions

Problem 1: The job_templates API endpoint can't keep up

We could adjust the timeout on the awx.awx collection invocation in the runner role, or we could try to solve this on the awx api side... But I'm not sure what we could do there to reduce response time.

Problem 2: Operator pod can get OOMKilled

The jobs failing leads to jobs getting backed up and results in more concurrent running AnsibleJob reconciliation loops and pods for the resource operator.

Solutions:

Supporting information:

When creating 100 AnsibleJob's at once, I see that about half of them fail with the timeout from the jobtemplates API endpoint. Also, ~ 50 ansiblejob related containers end up running concurrently (each retries once, so it doubles the container count per ansiblejob).

$ kubectl top pods | wc -l
53
Total CPU Usage: 3485 millicores (m)
Total Memory Usage: 6157 Mi

So the high number of jobs running, eats up the underlying node's memory, to be point that it starts needing to kill of pods.

rooftopcellist commented 10 months ago

More information:

So for example that I see in my test right now:

To check this, I queried for how many pods were in the failing state and lo and behold...

$ oc get pod | grep "Error" | wc -l
19
rooftopcellist commented 10 months ago

The failed AnsibleJob runs are actually succeeding in launching jobs in AWX, I now have over 600 jobs in AWX, all originating from the 100 AnsibleJob objects I created via the resource-operator.

A quick query to the AWX API shows that there are 155 jobs in pending state because we have overwhelmed AWX.

https://awx-awx.apps.aap-bugbash.ocp4.testing.ansible.com/api/v2/jobs/?status=pending

...
    "count": 155,
rooftopcellist commented 10 months ago

This is the memory usage of the operator pod when it has 100 ansiblejobs in queue (with default of 4 ansiblejob workers). I think we should set our requests accordingly.

Monitoring Pod: resource-operator-controller-manager-86f85b765c-st8r7
+------------------+-----------+-----------+-----------+
| Resource         |   Average |   Minimum |   Maximum |
+==================+===========+===========+===========+
| CPU (millicores) |  1033.73  |       449 |      1306 |
+------------------+-----------+-----------+-----------+
| Memory (Mi)      |   485.429 |       211 |       599 |
+------------------+-----------+-----------+-----------+