uWSGI listen queue of socket "127.0.0.1:8050" full / Database deadlock

anxstj commented 2 years ago

Please confirm the following

[X] I agree to follow this project's code of conduct.
[X] I have checked the current issues for duplicates.
[X] I understand that AWX is open source software provided for free and that I might not receive a timely response.

Bug Summary

In the last couple of days, we experienced an unreachable (504 gateway error) AWX. The container logs showed us:

uWSGI listen queue of socket "127.0.0.1:8050" (fd: 3) full !!! (101/100)

Looking at the database we see a PostgreSQL process fully consuming a CPU core:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                              
2314769 postgres  20   0 3870908   1.5g  44840 R  97.6  20.0 125:14.77 postgres

Which was caused by the following query:

psql awx -c "SELECT pid, age(clock_timestamp(), query_start), usename, query FROM pg_stat_activity 
WHERE query != '<IDLE>' AND state != 'idle' AND query NOT ILIKE '%pg_stat_activity%' ORDER BY query_start desc;"

2314769 | 00:00:08.193428 | sql_awx | UPDATE "main_jobevent" SET "host_id" = NULL WHERE "main_jobevent"."id" IN (277290805, 277290806,  ...

The following advisory locks were held at that moment:

> psql awx -c "SELECT * FROM pg_locks where locktype = 'advisory';

 locktype | database |  classid   |   objid    | objsubid | virtualtransaction |   pid   |     mode      | granted | fastpath         
----------+----------+------------+------------+----------+--------------------+---------+---------------+---------+----------
 advisory |    16402 | 4294967295 | 2479945980 |        1 | 46/36422           | 2355613 | ExclusiveLock | f       | f
 advisory |    16402 |          0 | 1226251610 |        1 | 38/146232          | 2355820 | ExclusiveLock | t       | f
 advisory |    16402 | 4294967295 | 2479945980 |        1 | 27/1068360         | 2314789 | ExclusiveLock | f       | f
 advisory |    16402 | 4294967295 | 2479945980 |        1 | 24/1428056         | 2355617 | ExclusiveLock | f       | f
 advisory |    16402 | 4294967295 | 2479945980 |        1 | 13/1129470         | 2352302 | ExclusiveLock | f       | f
 advisory |    16402 | 4294967295 | 2479945980 |        1 | 19/1688253         | 2314769 | ExclusiveLock | t       | f
(6 rows)

What struck me here is the other advisory lock that is granted:

 2355820 | 01:02:05.922299 | sql_awx | CREATE TABLE IF NOT EXISTS main_jobevent_20220726_12  PARTITION OF main_jobevent FOR VALUES FROM ('2022-07-26 12:00:00+00:00') to ('2022-07-26 13:00:00+00:00');

That's a simple query and still it's running for more than an hour.

Now I tried to find out which job was causing the UPDATE query (pid 2314769):

$ ps faux | grep 2314769
postgres 2314769 98.0 20.0 3879100 1627452 ?     Rs   11:20 132:42  \_ postgres: 12/main: sql_awx awx 10.61.35.14(37840) UPDATE

This shows the IP and source port. Going to the AWX node and searching for that open port reveals:

$  ss -ntp | grep 37840
ESTAB 0      0         172.18.0.3:37840     10.61.35.42:5432  users:(("awx-manage",pid=1574305,fd=45))
$ ps -fp 1574305
UID          PID    PPID  C STIME TTY          TIME CMD
awx      1574305      87  0 11:18 ?        00:00:40 /var/lib/awx/venv/awx/bin/python3.9 /usr/bin/awx-manage run_dispatcher
$ awx-manage run_dispatcher --status | grep -A1 1574305
.  worker[pid:1574305] sent=9 finished=8 qsize=1 rss=1531.555MB
     - running cb54f0ce-948e-4464-84e0-55756d1a92d1 RunInventoryUpdate(*[1010791])

The Job with ID 1010791 is an inventory synchronization job with source "VMware Center". And that job was running for more than an hour, despite that older jobs took only a couple of seconds. Killing that process solved the deadlock and the listen queue was freed up.

I'm wondering if the UPDATE query caused by the inventory update job and the CREATE TABLE query are blocking each other? Both are trying to access the same partition of the main_jobevent table.

Another explanation would be that the inventory update job got stuck for some reason (e.g. broken network connection) and the advisory lock wasn't removed because the SQL connection was still open. And the CREATE TABLE query was waiting on the other lock but holding the advisory lock for the new partition. On the other side, the UPDATE query was still burning CPU cycles that don't really look like a deadlock.

We experienced this situation at least three times. It was always:

an inventory sync job with an external source
an UPDATE query that was burning CPU cycles
a CREATE TABLE query that stuck for at least a couple of minutes
the uwsgi queue is filling up and makes the UI inaccessible

AWX version

21.2.0

Select the relevant components

[ ] UI
[ ] API
[ ] Docs
[ ] Collection
[ ] CLI
[ ] Other

Installation method

docker development environment

Modifications

yes

Ansible version

2.12.2

Operating system

Debian 11

Web browser

Firefox

Steps to reproduce

I don't know yet.

Expected results

listen queue doesn't pile up.

Actual results

listen queue is piling up.

Additional information

The development environment uses an external database.

anxstj commented 2 years ago

We found out that our inventory update job has removed one host from the inventory. The dispatcher will now update every jobevent the host was involved in with the following query:

UPDATE "main_jobevent" SET "host_id" = NULL WHERE "main_jobevent"."id" IN ($LIST_OF_EVENT_IDS);

With no more than 100 ids in $LIST_OF_EVENT_IDS. So these update queries handle not more than 100 events at a time and are executed in serial. Our main_jobevent table is approx. 140 GB in size and contains 90 days of data. Each UPDATE query takes 10 seconds and we have at least 1200 events per day and host. That are 108,000 events the host was involved in. The UPDATE query can process 36,000 events per hour, so the dispatcher will need more than 3 hours to update the table.

Unfortunately the CREATE TABLE IF NOT EXISTS main_jobevent_20220726_12 PARTITION OF main_jobevent ... queries are blocked by the process that executes the UPDATE statements and these take 3 hours. That means for three hours no new events can be added to the database because the partition of the main_jobevent table wasn't yet created.

And finally, that's causing the uwsgi listen queue to fill up.

So, the dispatcher's UPDATE process shouldn't block the CREATE TABLE statements.

fosterseth commented 2 years ago

this PR might be addressing the same / similar issue https://github.com/ansible/awx/pull/12447

lutskevich commented 1 year ago

Have same problem, 504 and uWSGI listen queue of socket "127.0.0.1:8050" (fd: 3) full !!! (101/100) How can i fix it? AWX 22.4.0

fosterseth commented 1 year ago

@kdelee do you know if your PR would address this? https://github.com/ansible/awx-operator/pull/1487

kdelee commented 1 year ago

I'm not 100% connecting dots between the uwsgi listen queue filling up and the other issues, other than the CPU being 100% consumed may be in general slowing down everything, causing requests to get served slower, which in turn may cause API/UI to be able to serve less clients before listen queue gets overwhelmed. (Higher latency for each client, lower throughput per second)

So using https://github.com/ansible/awx-operator/pull/1487 to increase the uwsgi listen queue may give a bit more room for API/UI to be slow before it starts throwing 504/502

ansible / awx