Set max current reconciles flag in Dockerfile

ansible / awx-resource-operator

41 stars 34 forks source link

Set max current reconciles flag in Dockerfile #151

Open rooftopcellist opened 10 months ago

rooftopcellist commented 10 months ago

Set max current reconciles flag in Dockerfile because setting it in config/manager/manager.yaml args section didn't work. I set the default low, to 2, but set ansiblejob and ansibleworkflow higher at 3. This throttling should help us not overwhelm the operator container.

Set max concurrent ansiblejob to 3
Set max concurrent ansibleworkflow to 3

Follow up for https://github.com/ansible/awx-resource-operator/pull/150

Users can increase these values by setting new env vars on the Subscription or Deployment for the operator.

rooftopcellist commented 10 months ago

I want to do some testing with and without this change at scale before merging. I think our original assumption that the number of concurrent jobs was not being throttled may have been wrong based on this:

The --max-concurrent-reconciles flag can be used to override the default max concurrent reconciles, which by default is the number of CPUs on the node on which the operator is running.

https://sdk.operatorframework.io/docs/building-operators/helm/reference/advanced_features/max_concurrent_reconciles/

rebeccahhh commented 10 months ago

@rooftopcellist we can sync up on this another time but this line makes me concerned: default is the number of CPUs on the node on which the operator is running. I don't think we were seeing that behavior when we tested without the max_reconciles set.

rooftopcellist commented 10 months ago

@rebeccahhh afaik, it is not basing that off of the requests/limits for the resource operator pod, it is instead basing that off of the number of CPU's on the node the resource operator pod is scheduled on from what I can tell.

And for that, it seems to be correct:

$ oc get node aap-dev-8scgk-worker-a-9v8vw -o yaml | grep cpu:
    cpu: 3500m
    cpu: "4"

We were seeing 4 workers get set for each resource by default. I think it does this because number of parallel workers you can have is more based off of the number of CPU cores, rather than the capacity of each core (or in our case, the capacity allocated to the pod).