Open rooftopcellist opened 1 year ago
I want to do some testing with and without this change at scale before merging. I think our original assumption that the number of concurrent jobs was not being throttled may have been wrong based on this:
The --max-concurrent-reconciles flag can be used to override the default max concurrent reconciles, which by default is the number of CPUs on the node on which the operator is running.
@rooftopcellist we can sync up on this another time but this line makes me concerned: default is the number of CPUs on the node on which the operator is running.
I don't think we were seeing that behavior when we tested without the max_reconciles set.
@rebeccahhh afaik, it is not basing that off of the requests/limits for the resource operator pod, it is instead basing that off of the number of CPU's on the node the resource operator pod is scheduled on from what I can tell.
And for that, it seems to be correct:
$ oc get node aap-dev-8scgk-worker-a-9v8vw -o yaml | grep cpu:
cpu: 3500m
cpu: "4"
We were seeing 4 workers get set for each resource by default. I think it does this because number of parallel workers you can have is more based off of the number of CPU cores, rather than the capacity of each core (or in our case, the capacity allocated to the pod).
Set max current reconciles flag in Dockerfile because setting it in config/manager/manager.yaml args section didn't work. I set the default low, to 2, but set ansiblejob and ansibleworkflow higher at 3. This throttling should help us not overwhelm the operator container.
Follow up for https://github.com/ansible/awx-resource-operator/pull/150
Users can increase these values by setting new env vars on the Subscription or Deployment for the operator.