Open anxstj opened 2 years ago
We found out that our inventory update job has removed one host from the inventory. The dispatcher will now update every jobevent the host was involved in with the following query:
UPDATE "main_jobevent" SET "host_id" = NULL WHERE "main_jobevent"."id" IN ($LIST_OF_EVENT_IDS);
With no more than 100 ids in $LIST_OF_EVENT_IDS. So these update queries handle not more than 100 events at a time and are executed in serial. Our main_jobevent table is approx. 140 GB in size and contains 90 days of data. Each UPDATE query takes 10 seconds and we have at least 1200 events per day and host. That are 108,000 events the host was involved in. The UPDATE query can process 36,000 events per hour, so the dispatcher will need more than 3 hours to update the table.
Unfortunately the CREATE TABLE IF NOT EXISTS main_jobevent_20220726_12 PARTITION OF main_jobevent ...
queries are blocked by the process that executes the UPDATE statements and these take 3 hours. That means for three hours no new events can be added to the database because the partition of the main_jobevent table wasn't yet created.
And finally, that's causing the uwsgi listen queue to fill up.
So, the dispatcher's UPDATE process shouldn't block the CREATE TABLE statements.
this PR might be addressing the same / similar issue https://github.com/ansible/awx/pull/12447
Have same problem, 504 and uWSGI listen queue of socket "127.0.0.1:8050" (fd: 3) full !!! (101/100) How can i fix it? AWX 22.4.0
@kdelee do you know if your PR would address this? https://github.com/ansible/awx-operator/pull/1487
I'm not 100% connecting dots between the uwsgi listen queue filling up and the other issues, other than the CPU being 100% consumed may be in general slowing down everything, causing requests to get served slower, which in turn may cause API/UI to be able to serve less clients before listen queue gets overwhelmed. (Higher latency for each client, lower throughput per second)
So using https://github.com/ansible/awx-operator/pull/1487 to increase the uwsgi listen queue may give a bit more room for API/UI to be slow before it starts throwing 504/502
Please confirm the following
Bug Summary
In the last couple of days, we experienced an unreachable (504 gateway error) AWX. The container logs showed us:
Looking at the database we see a PostgreSQL process fully consuming a CPU core:
Which was caused by the following query:
The following advisory locks were held at that moment:
What struck me here is the other advisory lock that is granted:
That's a simple query and still it's running for more than an hour.
Now I tried to find out which job was causing the UPDATE query (pid 2314769):
This shows the IP and source port. Going to the AWX node and searching for that open port reveals:
The Job with ID 1010791 is an inventory synchronization job with source "VMware Center". And that job was running for more than an hour, despite that older jobs took only a couple of seconds. Killing that process solved the deadlock and the listen queue was freed up.
I'm wondering if the UPDATE query caused by the inventory update job and the CREATE TABLE query are blocking each other? Both are trying to access the same partition of the main_jobevent table.
Another explanation would be that the inventory update job got stuck for some reason (e.g. broken network connection) and the advisory lock wasn't removed because the SQL connection was still open. And the CREATE TABLE query was waiting on the other lock but holding the advisory lock for the new partition. On the other side, the UPDATE query was still burning CPU cycles that don't really look like a deadlock.
We experienced this situation at least three times. It was always:
AWX version
21.2.0
Select the relevant components
Installation method
docker development environment
Modifications
yes
Ansible version
2.12.2
Operating system
Debian 11
Web browser
Firefox
Steps to reproduce
I don't know yet.
Expected results
listen queue doesn't pile up.
Actual results
listen queue is piling up.
Additional information
The development environment uses an external database.