Open rooftopcellist opened 10 months ago
More information:
So for example that I see in my test right now:
To check this, I queried for how many pods were in the failing state and lo and behold...
$ oc get pod | grep "Error" | wc -l
19
The failed AnsibleJob runs are actually succeeding in launching jobs in AWX, I now have over 600 jobs in AWX, all originating from the 100 AnsibleJob objects I created via the resource-operator.
A quick query to the AWX API shows that there are 155 jobs in pending state because we have overwhelmed AWX.
https://awx-awx.apps.aap-bugbash.ocp4.testing.ansible.com/api/v2/jobs/?status=pending
...
"count": 155,
This is the memory usage of the operator pod when it has 100 ansiblejobs in queue (with default of 4 ansiblejob workers). I think we should set our requests accordingly.
Monitoring Pod: resource-operator-controller-manager-86f85b765c-st8r7
+------------------+-----------+-----------+-----------+
| Resource | Average | Minimum | Maximum |
+==================+===========+===========+===========+
| CPU (millicores) | 1033.73 | 449 | 1306 |
+------------------+-----------+-----------+-----------+
| Memory (Mi) | 485.429 | 211 | 599 |
+------------------+-----------+-----------+-----------+
Summary
AnsibleJob resources fail when run at scale because the AWX api/v2/job_templates endpoint can get very slow when there are many concurrent running jobs.
Details
I created 200 ansiblejobs at once and noticed that a number of them failed on this task because of a timeout caused by AWX taking too long to respond:
I did a couple GET's to the endpoint in question and the request times from the API were ranging from .3 to 3 seconds.
Analysis & Solutions
Problem 1: The job_templates API endpoint can't keep up
We could adjust the timeout on the awx.awx collection invocation in the runner role, or we could try to solve this on the awx api side... But I'm not sure what we could do there to reduce response time.
Problem 2: Operator pod can get OOMKilled
The jobs failing leads to jobs getting backed up and results in more concurrent running AnsibleJob reconciliation loops and pods for the resource operator.
Solutions:
Supporting information:
When creating 100 AnsibleJob's at once, I see that about half of them fail with the timeout from the jobtemplates API endpoint. Also, ~ 50 ansiblejob related containers end up running concurrently (each retries once, so it doubles the container count per ansiblejob).
So the high number of jobs running, eats up the underlying node's memory, to be point that it starts needing to kill of pods.