API calls to AWX read operation timed out

sylvain-de-fuster commented 8 months ago

Please confirm the following

[X] I agree to follow this project's code of conduct.
[X] I have checked the current issues for duplicates.
[X] I understand that AWX is open source software provided for free and that I might not receive a timely response.
[X] I am NOT reporting a (potential) security vulnerability. (These should be emailed to security@ansible.com instead.)

Bug Summary

Our environment :

1 awx master on k3s (v1.25.4+k3s1) with 2 external workers (using receptor) with 8 vCPUs and 16G RAM
projects are hosted on enterprise github
SAML auth
Upgraded from 22.1.0 to 23.3.0 (~ december)
Control node EE fixed to the AWX version to avoid "latest" usage. (23.3.0 currently). Only custom workers EE for playbooks.
Postgres database size ~90G.
No limit added to pods

We have since a few weeks (maybe months now) some timeout issues to contact the AWX API. It's not systematic but it became more frequent (weekly if lucky and sometimes daily). It is typically experienced when a playbook interact with AWX with awx.awx collection to create schedule, launch job, wait for job end, etc.

Typical message : "There was an unknown error when trying to connect to https://awx-faky.fake/api/v2/XXXXX/XXXXX: timeout The read operation timed out"

AWX version

23.3.0

Select the relevant components

[ ] UI
[ ] UI (tech preview)
[X] API
[ ] Docs
[ ] Collection
[ ] CLI
[ ] Other

Installation method

kubernetes

Modifications

no

Ansible version

2.14

Operating system

RHEL 8.9

Web browser

No response

Steps to reproduce

Didn't find a way to reproduce it with the exact same behaviour. With our dev environment, I tried to trigger the API from awx jobs and from external calls. I managed to reproduce the timeout error but the awx master also got heavy load1m (which may be the source of timeout in that case) so it wasn't the exact same behavior.

Expected results

API calls to AWX don't get timeouts.

Actual results

API calls to AWX can produce timeout errors even when the awx plateform doesn't run other jobs.

Additional information

I checked some metrics and logs but without any real finding.

• System CPU and load1m can rise up a bit but no saturation so it seems to be more likely a consequence than the cause. Memory doesn't change (app, cache, slab and everything but free =~ 90%) Not a clear process behavior. awx-manage and uwsgi can use a bit more CPU. Heavy read and i/o on awx filesystem can be observed sometimes (nearly the DB size) but not specially when a timeout is reach on api. Checked also CPU/MEM limits on pods/containers and none were reached. Logs do not show anything abnormal.

• AWX The timeout errors can appear without other jobs running (so theorically no saturation of the control node, good capacity available for control and execution nodes. Web pod : I can see the requests associated to the playbook calls but no errors or warnings to link. Task pod : Constant warnings and some errors for the ee container but not related to our issues

"Could not read in control service: read unix /var/run/receptor/receptor.sock->@: use of closed network connection"
"Exceeded retries for reading stdout /tmp/receptor/awx-faky-fake-7bd9c49475-2c9bj/fHz0QLTm/stdout"

After reading some documnentation about performance (Automation Controller Performance for example), we need to do some work on our infrastructure (number of master, forks on jobs, dedicated db host, log aggregator, etc). But since our current timeout issue can append without specific load on awx infrastructure, that doesn't seem to be a solution (and it could potentially hide the real issue).

A few weeks ago, we experienced some recurrent awx plateform crashes. We find out that our DB wasn't correctly purged by the cleanup jobs (related to unpartitionned tables) and was growing (nearly 127G). After fixing it, the DB shrinked to ~70G and the crashes stopped. But maybe DB is still to large and in some cases api access can be slow and reach time out ?

fosterseth commented 8 months ago

any trends in the API calls you are making? what endpoints, how busy is the system when this happens (does it coincide with high traffic overall)?

are the uswgi processes staying health when you see timeouts?

sylvain-de-fuster commented 8 months ago

Hello,

It doesn't seem to be a specific/unique endpoint impacted. Usually, we use endpoints matching these purposes :

Create job templates
Create schedules
Create inventories
Launch jobs

The system doesn't have to be heavy loaded to experience theses errors.

It's very "precise" : For example, a playbook with a task that use awx.awx.schedule with a loop can :

run perfectly
fail on one unique iteration of the list parsed

All that regardless of system usage.

For the uswgi part, do you have specific logs of metrics that could help me answering you ?

I can notice some messages like that in the web container but not sure it always coincide to timeouts and are related generally speaking :

• Lots ok WSCONNECT/WSDISCONNECT sometimes

127.0.0.1:39582 - - [24/Jan/2024:15:15:11] "WSCONNECTING /websocket/" - -
127.0.0.1:39582 - - [24/Jan/2024:15:15:11] "WSCONNECT /websocket/" - -
127.0.0.1:39596 - - [24/Jan/2024:15:15:15] "WSCONNECTING /websocket/" - -
127.0.0.1:39596 - - [24/Jan/2024:15:15:15] "WSCONNECT /websocket/" - -
127.0.0.1:39816 - - [24/Jan/2024:15:15:23] "WSCONNECTING /websocket/" - -
127.0.0.1:39816 - - [24/Jan/2024:15:15:23] "WSCONNECT /websocket/" - -
127.0.0.1:39826 - - [24/Jan/2024:15:15:23] "WSCONNECTING /websocket/" - -
127.0.0.1:39826 - - [24/Jan/2024:15:15:23] "WSCONNECT /websocket/" - -
127.0.0.1:39826 - - [24/Jan/2024:15:15:30] "WSDISCONNECT /websocket/" - -
127.0.0.1:53770 - - [24/Jan/2024:15:15:34] "WSDISCONNECT /websocket/" - -
127.0.0.1:44636 - - [24/Jan/2024:15:15:42] "WSCONNECTING /websocket/" - -
127.0.0.1:44636 - - [24/Jan/2024:15:15:43] "WSCONNECT /websocket/" - -
127.0.0.1:44644 - - [24/Jan/2024:15:15:45] "WSCONNECTING /websocket/" - -
127.0.0.1:44644 - - [24/Jan/2024:15:15:45] "WSCONNECT /websocket/" - -
127.0.0.1:44654 - - [24/Jan/2024:15:15:45] "WSCONNECTING /websocket/" - -
127.0.0.1:44654 - - [24/Jan/2024:15:15:45] "WSCONNECT /websocket/" - -
127.0.0.1:44654 - - [24/Jan/2024:15:15:57] "WSDISCONNECT /websocket/" - -
127.0.0.1:44336 - - [24/Jan/2024:15:16:19] "WSDISCONNECT /websocket/" - -
127.0.0.1:33904 - - [24/Jan/2024:15:16:21] "WSCONNECTING /websocket/" - -
127.0.0.1:33904 - - [24/Jan/2024:15:16:21] "WSCONNECT /websocket/" - -
127.0.0.1:33912 - - [24/Jan/2024:15:16:21] "WSCONNECTING /websocket/" - -
127.0.0.1:33912 - - [24/Jan/2024:15:16:21] "WSCONNECT /websocket/" - -
127.0.0.1:33912 - - [24/Jan/2024:15:16:23] "WSDISCONNECT /websocket/" - -
127.0.0.1:33926 - - [24/Jan/2024:15:16:23] "WSCONNECTING /websocket/" - -

• Some uWSGI kills/respawned

worker 2 killed successfully (pid: 1476)
Respawned uWSGI worker 2 (new pid: 1501)
mounting awx.wsgi:application on /

If I check all the web logs (2024-01-24 15:17:17,684 to 2024-01-25 13:25:34,331 for this case)

# kubectl logs pod/awx-faky-web-XXXXXX -c awx-faky-web | grep 'Respawned uWSGI' | wc -l
48

If you need more informations, don't hesitate.

MaxBidlingmaier commented 5 months ago

Hi, we got the same problem. We use AWX to execute jobs created by another automation solution. When we start concurrent jobs. (e.g. 200) the API starts lagging and finally does not really respond anymore.

Our tests works as follows:

start 200 AWX jobs (short running, small jobs running on localhost with only one debug statement in the playbook), the start is done as fast/parallel as possible using a celery worker setup
for each job that has been startet a pull job will get the current state for the respective job every 15 seconds until a final state (failed, success, canceled) has been reached.

With this test around 45-50% of my workers jobs will report "The read operation timed out" with 2 AWX Web pods, this amount can be reduced a bit by increasing the Web pods to 3. Seems the AWX Web service is not responding too good to increasing amount of requests.

The test system has been an isolated system only executing the test. No other requests then the test requests where sent to the system during the tests. No UI access during the test either. Are there hints from you as AWX project on possible amount of API calls or known limitations? Any advice how to further look into this?

Test used AWX 22.3.0 on kubernetes. K8S worker nodes with CPU usage < 50%. Highest values of CPU usage for AWX pods is around 1 core while the test is running.

Best regards Max

See here, looks like same issue: https://github.com/ansible/awx-resource-operator/issues/152, too.

ansible / awx