ansible / awx

AWX provides a web-based user interface, REST API, and task engine built on top of Ansible. It is one of the upstream projects for Red Hat Ansible Automation Platform.
Other
13.9k stars 3.4k forks source link

API calls to AWX read operation timed out #14781

Open sylvain-de-fuster opened 8 months ago

sylvain-de-fuster commented 8 months ago

Please confirm the following

Bug Summary

Our environment :

We have since a few weeks (maybe months now) some timeout issues to contact the AWX API. It's not systematic but it became more frequent (weekly if lucky and sometimes daily). It is typically experienced when a playbook interact with AWX with awx.awx collection to create schedule, launch job, wait for job end, etc.

Typical message : "There was an unknown error when trying to connect to https://awx-faky.fake/api/v2/XXXXX/XXXXX: timeout The read operation timed out"

AWX version

23.3.0

Select the relevant components

Installation method

kubernetes

Modifications

no

Ansible version

2.14

Operating system

RHEL 8.9

Web browser

No response

Steps to reproduce

Didn't find a way to reproduce it with the exact same behaviour. With our dev environment, I tried to trigger the API from awx jobs and from external calls. I managed to reproduce the timeout error but the awx master also got heavy load1m (which may be the source of timeout in that case) so it wasn't the exact same behavior.

Expected results

API calls to AWX don't get timeouts.

Actual results

API calls to AWX can produce timeout errors even when the awx plateform doesn't run other jobs.

Additional information

I checked some metrics and logs but without any real finding.

• System CPU and load1m can rise up a bit but no saturation so it seems to be more likely a consequence than the cause. Memory doesn't change (app, cache, slab and everything but free =~ 90%) Not a clear process behavior. awx-manage and uwsgi can use a bit more CPU. Heavy read and i/o on awx filesystem can be observed sometimes (nearly the DB size) but not specially when a timeout is reach on api. Checked also CPU/MEM limits on pods/containers and none were reached. Logs do not show anything abnormal.

• AWX The timeout errors can appear without other jobs running (so theorically no saturation of the control node, good capacity available for control and execution nodes. Web pod : I can see the requests associated to the playbook calls but no errors or warnings to link. Task pod : Constant warnings and some errors for the ee container but not related to our issues

After reading some documnentation about performance (Automation Controller Performance for example), we need to do some work on our infrastructure (number of master, forks on jobs, dedicated db host, log aggregator, etc). But since our current timeout issue can append without specific load on awx infrastructure, that doesn't seem to be a solution (and it could potentially hide the real issue).

A few weeks ago, we experienced some recurrent awx plateform crashes. We find out that our DB wasn't correctly purged by the cleanup jobs (related to unpartitionned tables) and was growing (nearly 127G). After fixing it, the DB shrinked to ~70G and the crashes stopped. But maybe DB is still to large and in some cases api access can be slow and reach time out ?

fosterseth commented 8 months ago

any trends in the API calls you are making? what endpoints, how busy is the system when this happens (does it coincide with high traffic overall)?

are the uswgi processes staying health when you see timeouts?

sylvain-de-fuster commented 8 months ago

Hello,

It doesn't seem to be a specific/unique endpoint impacted. Usually, we use endpoints matching these purposes :

The system doesn't have to be heavy loaded to experience theses errors.

It's very "precise" : For example, a playbook with a task that use awx.awx.schedule with a loop can :

All that regardless of system usage.

For the uswgi part, do you have specific logs of metrics that could help me answering you ?

I can notice some messages like that in the web container but not sure it always coincide to timeouts and are related generally speaking :

• Lots ok WSCONNECT/WSDISCONNECT sometimes

127.0.0.1:39582 - - [24/Jan/2024:15:15:11] "WSCONNECTING /websocket/" - -
127.0.0.1:39582 - - [24/Jan/2024:15:15:11] "WSCONNECT /websocket/" - -
127.0.0.1:39596 - - [24/Jan/2024:15:15:15] "WSCONNECTING /websocket/" - -
127.0.0.1:39596 - - [24/Jan/2024:15:15:15] "WSCONNECT /websocket/" - -
127.0.0.1:39816 - - [24/Jan/2024:15:15:23] "WSCONNECTING /websocket/" - -
127.0.0.1:39816 - - [24/Jan/2024:15:15:23] "WSCONNECT /websocket/" - -
127.0.0.1:39826 - - [24/Jan/2024:15:15:23] "WSCONNECTING /websocket/" - -
127.0.0.1:39826 - - [24/Jan/2024:15:15:23] "WSCONNECT /websocket/" - -
127.0.0.1:39826 - - [24/Jan/2024:15:15:30] "WSDISCONNECT /websocket/" - -
127.0.0.1:53770 - - [24/Jan/2024:15:15:34] "WSDISCONNECT /websocket/" - -
127.0.0.1:44636 - - [24/Jan/2024:15:15:42] "WSCONNECTING /websocket/" - -
127.0.0.1:44636 - - [24/Jan/2024:15:15:43] "WSCONNECT /websocket/" - -
127.0.0.1:44644 - - [24/Jan/2024:15:15:45] "WSCONNECTING /websocket/" - -
127.0.0.1:44644 - - [24/Jan/2024:15:15:45] "WSCONNECT /websocket/" - -
127.0.0.1:44654 - - [24/Jan/2024:15:15:45] "WSCONNECTING /websocket/" - -
127.0.0.1:44654 - - [24/Jan/2024:15:15:45] "WSCONNECT /websocket/" - -
127.0.0.1:44654 - - [24/Jan/2024:15:15:57] "WSDISCONNECT /websocket/" - -
127.0.0.1:44336 - - [24/Jan/2024:15:16:19] "WSDISCONNECT /websocket/" - -
127.0.0.1:33904 - - [24/Jan/2024:15:16:21] "WSCONNECTING /websocket/" - -
127.0.0.1:33904 - - [24/Jan/2024:15:16:21] "WSCONNECT /websocket/" - -
127.0.0.1:33912 - - [24/Jan/2024:15:16:21] "WSCONNECTING /websocket/" - -
127.0.0.1:33912 - - [24/Jan/2024:15:16:21] "WSCONNECT /websocket/" - -
127.0.0.1:33912 - - [24/Jan/2024:15:16:23] "WSDISCONNECT /websocket/" - -
127.0.0.1:33926 - - [24/Jan/2024:15:16:23] "WSCONNECTING /websocket/" - -

• Some uWSGI kills/respawned

worker 2 killed successfully (pid: 1476)
Respawned uWSGI worker 2 (new pid: 1501)
mounting awx.wsgi:application on /

If I check all the web logs (2024-01-24 15:17:17,684 to 2024-01-25 13:25:34,331 for this case)

# kubectl logs pod/awx-faky-web-XXXXXX -c awx-faky-web | grep 'Respawned uWSGI' | wc -l
48

If you need more informations, don't hesitate.

MaxBidlingmaier commented 5 months ago

Hi, we got the same problem. We use AWX to execute jobs created by another automation solution. When we start concurrent jobs. (e.g. 200) the API starts lagging and finally does not really respond anymore.

Our tests works as follows:

With this test around 45-50% of my workers jobs will report "The read operation timed out" with 2 AWX Web pods, this amount can be reduced a bit by increasing the Web pods to 3. Seems the AWX Web service is not responding too good to increasing amount of requests.

The test system has been an isolated system only executing the test. No other requests then the test requests where sent to the system during the tests. No UI access during the test either. Are there hints from you as AWX project on possible amount of API calls or known limitations? Any advice how to further look into this?

Test used AWX 22.3.0 on kubernetes. K8S worker nodes with CPU usage < 50%. Highest values of CPU usage for AWX pods is around 1 core while the test is running.

Best regards Max

See here, looks like same issue: https://github.com/ansible/awx-resource-operator/issues/152, too.