502 Bad Gateway Errors in AWX GUI browsing Jobs, Templates

2and3makes23 commented 2 years ago

Please confirm the following

[X] I agree to follow this project's code of conduct.
[X] I have checked the current issues for duplicates.
[X] I understand that AWX is open source software provided for free and that I might not receive a timely response.

Bug Summary

When browsing Jobs or other job related lists via UI or API, nginx produces 502 Bad Gateway errors, when certain Jobs are part of the result to be shown to the user.

Those jobs have in common that they ended in error state.

Log entries that might be of interest:

Web container


2022/08/11 07:11:04 [error] 31#31: *5632 upstream prematurely closed connection while reading response header from upstream, client: <IP>, server: _, request: "GET /api/v2/unified_job_templates/?order_by=name&organization__name__icontains=<some_org>&page=1&page_size=20&type=job_template%2Cworkflow_job_template HTTP/1.1", upstream: "uwsgi://127.0.0.1:8050", host: "<awx_url>", referrer: "<awx_url>"
<IP> - - [11/Aug/2022:07:11:04 +0000] "GET /api/v2/unified_job_templates/?order_by=name&organization__name__icontains=<some_org>&page=1&page_size=20&type=job_template%2Cworkflow_job_template HTTP/1.1" 502 150 "<awx_url>" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0" "<IP>, <IP>"
DAMN ! worker 1 (pid: 62) died, killed by signal 9 :( trying respawn ...
Respawned uWSGI worker 1 (new pid: 65)
mounting awx.wsgi:application on /

...

[10/Aug/2022:08:12:02 +0000] "GET /websocket/broadcast/ HTTP/1.1" 502 150 "-" "Python/3.9 aiohttp/3.7.4" "-" 2022/08/10 08:12:02 [error] 31#31: *1 connect() failed (111: Connection refused) while connecting to upstream, client: , server: _, request: "GET /websocket/broadcast/ HTTP/1.1", upstream: "http://127.0.0.1:8051/websocket/broadcast/", host: ""


* Task container

2022/08/10 08:12:02 [error] 31#31: *1 connect() failed (111: Connection refused) while connecting to upstream, client: , server: _, request: "GET /websocket/broadcast/ HTTP/1.1", upstream: "http://127.0.0.1:8051/websocket/broadcast/", host: ":8052"

- - [10/Aug/2022:08:12:07 +0000] "GET /websocket/broadcast/ HTTP/1.1" 502 150 "-" "Python/3.9 aiohttp/3.7.4" "-" ``` ### AWX version 21.3.0 ### Select the relevant components - [X] UI - [X] API - [ ] Docs - [ ] Collection - [ ] CLI - [ ] Other ### Installation method openshift ### Modifications no ### Ansible version 2.12.5.post0 ### Operating system Windows 10 ### Web browser Firefox, Chrome, Safari, Edge ### Steps to reproduce Some of our errenous jobs might have failed because of #11805, but there might be other causes like OpenShift lifecycle pocesses. At first we thought, those jobs were corrupted within our DB, but sometimes we are able to view (at least some of) them in browser, getting no 502 error so that seems to be out of the equation. When limiting elements per page to 1, it is possible to switch pages (without error) until you hit a faulty one. ### Expected results No errors when viewing jobs or job related elements in UI/via API ### Actual results 502 Bad Gateway ### Additional information Any hint on how to troubleshoot this is much appreciated

fust commented 2 years ago

+1 We are facing the same problem, I was just about to create a new issue when I saw this one.

I've been digging for a few hours and the TL;DR is that I traced the problem to jobs with a lot of output causing the uWSGI thread to crash. Increasing the memory for the awx-web container to 2GB fixed the job overview not loading correctly but we need to further increase it to be able to load the job output.

We saw the uWSGI thread get killed with a signal 9, which indicates an OOM scenario. Once we isolated it to a specific job ID and could reproduce the issue reliably we turned on query logging in PostgreSQL. This got us the exact offending query, almost immediatly followed by an EOF error. Running the query on the PostgreSQL server directly and sending the output to a file resulted in a ~700MB file (in human-readble form, so not what is actually sent to AWX).

As I stated at the start, increasing the request/limit for the awx-web container fixed the job overview and job details pages but the job output still won't load.

I'm hesitant to increase the memory even further, that just feels like masking the underlying problem. This only started happening after upgrading from 20.0.1 to 21.3.0

fosterseth commented 2 years ago

| This got us the exact offending query,

would you be able to get us that query? @fust

also wondering if same issue can be replicated by hitting the api/v2/jobs/<jobid>/job_events/children_summary/ endpoint -- this endpoint will attempt to load in some of the data for each event produced by that job. There is currently no cap on how much data it will attempt to load, and might be a cause of crashing uwsgi worker

fosterseth commented 2 years ago

doing some local testing

start dev containers with 1 uwsgi process
create job with 2 million events
monitor uwsgi process memory usage using py_mem.py
hit the api/v2/jobs/12/job_events/children_summary/ endpoint

I can see uwsgi process spike up to 1.9 Gb memory usage

So if you have large number of events for a job, and multiple users (or even a single user refreshing the job output page) trying to view the stdout out for that job, you can easily breach 2 Gb for the entire container.

@fust @2and3makes23 do you get the 502 if you append ?x= when landing on the job output page e.g. #/jobs/playbook/10/output?x=

this tricks the browser into avoiding loading all events at once for that job

wtmthethird commented 2 years ago

We're seeing this same behavior with 21.2.0 as well, and also only on jobs that enter an "ERROR" state. Other jobs, even with large outputs/task, display correctly, albeit slowly.

However, a user viewing a job's output that is in an "ERROR" state crashes the awx-p-web container, I've seen 5gb+ be requested before it OOMs.

We're running in GKE, and we're using cloud sql. In the pgsql logs (we have not turned logging on for all queries, though) one oddity that does come up is that it appears to be doing a DB query for the pods in an ERROR state, but we can't see corresponding DB queries for pods that are in a successful or failed state. I do not know if this is relevant or not, but it is rather confusing that we're only seeing queries against the database if we're trying to view an ERROR'd job.

The query that's being run is:

db=awxp,user=awx STATEMENT:  SELECT "main_unifiedjob"."id", "main_unifiedjob"."polymorphic_ctype_id", "main_unifiedjob"."modified", "main_unifiedjob"."description", "main_unifiedjob"."created_by_id", "main_unifiedjob"."modified_by_id", "main_unifiedjob"."name", "main_unifiedjob"."execution_environment_id", "main_unifiedjob"."old_pk", "main_unifiedjob"."emitted_events", "main_unifiedjob"."unified_job_template_id", "main_unifiedjob"."created", "main_unifiedjob"."launch_type", "main_unifiedjob"."schedule_id", "main_unifiedjob"."execution_node", "main_unifiedjob"."controller_node", "main_unifiedjob"."cancel_flag", "main_unifiedjob"."status", "main_unifiedjob"."failed", "main_unifiedjob"."started", "main_unifiedjob"."dependencies_processed", "main_unifiedjob"."finished", "main_unifiedjob"."canceled_on", "main_unifiedjob"."elapsed", "main_unifiedjob"."job_args", "main_unifiedjob"."job_cwd", "main_unifiedjob"."job_env", "main_unifiedjob"."job_explanation", "main_unifiedjob"."start_args", "main_unifiedjob"."result_traceback", "main_unifiedjob"."celery_task_id", "main_unifiedjob"."instance_group_id", "main_unifiedjob"."organization_id", "main_unifiedjob"."installed_collections", "main_unifiedjob"."ansible_version", "main_unifiedjob"."host_status_counts", "main_unifiedjob"."work_unit_id", "main_job"."unifiedjob_ptr_id", "main_job"."survey_passwords", "main_job"."custom_virtualenv", "main_job"."webhook_service", "main_job"."webhook_credential_id", "main_job"."webhook_guid", "main_job"."diff_mode", "main_job"."job_type", "main_job"."inventory_id", "main_job"."project_id", "main_job"."playbook", "main_job"."scm_branch", "main_job"."forks", "main_job"."limit", "main_job"."verbosity", "main_job"."extra_vars", "main_job"."job_tags", "main_job"."force_handlers", "main_job"."skip_tags", "main_job"."start_at_task", "main_job"."become_enabled", "main_job"."allow_simultaneous", "main_job"."timeout", "main_job"."use_fact_cache", "main_job"."job_template_id", "main_job"."artifacts", "main_job"."scm_revision", "main_job"."project_update_id", "main_job"."job_slice_number", "main_job"."job_slice_count" FROM "main_job" INNER JOIN "main_unifiedjob" ON ("main_job"."unifiedjob_ptr_id" = "main_unifiedjob"."id") WHERE "main_job"."unifiedjob_ptr_id" = 7997 LIMIT 21

Running the query on the database produces an ~11MB file. Also, we can see the query we run directly against the database in the logs as well, supporting the earlier question of why are only the ERROR'd jobs showing as queries against the DB?

When we run this same query against an identical playbook run, but one a week prior that did not enter an error'd state, we see a 5.6KB file instead of 11MB. That is a striking difference in and of itself.

We also tried to search through the code to find where this is coming from, and could not locate a "LIMIT 21", and also can't find any inner joins with exception of those against the rbac tables.

Lastly, hitting /api/v2/jobs//job_events/children_summary/ does nothing for this jobid. The raw data looks like this:

{"children_summary":{},"meta_event_nested_uuid":{},"event_processing_finished":true,"is_tree":false}

You just tested this as I was typing, but I want to point out that our issues are surrounding jobs that enter an "ERROR" state, successful jobs we see memory spikes, but nothing crashes.

(A second lastly, since you commented!): Hitting /#/jobs/playbook//output?x= does not crash the container.

Thanks for looking into this.

fust commented 2 years ago

I've had to search for a bit to find the query, we haven't seen any of these issues as of late. That might be because I've been hunting down and fixing all sorts of things to minimize the number of failing jobs. As @wtmthethird stated it seems to be triggered on failed jobs alone.

The relevent part from the PostgreSQL log is this:

2022-08-11 09:12:25.675 CEST 62f4abd9.17207 00000 11 LOG:  statement: BEGIN94727
2022-08-11 09:12:25.676 CEST 62f4abd9.17207 00000 12 LOG:  statement: SELECT COUNT(*) AS "__count" FROM "main_unifiedjob" LEFT OUTER JOIN "main_workflowapproval" ON ("main_unifiedjob"."id" = "main_workflowapproval"."unifiedjob_ptr_id") WHERE "main_workflowapproval"."unifiedjob_ptr_id" IS NULL94727 2022-08-11 09:12:25.679 CEST 62f4abd9.17207 00000 13 LOG:  statement: SELECT "main_unifiedjob"."id", "main_unifiedjob"."polymorphic_ctype_id", "main_unifiedjob"."modified", "main_unifiedjob"."description", "main_unifiedjob"."created_by_id", "main_unifiedjob"."modified_by_id", "main_unifiedjob"."name", "main_unifiedjob"."execution_environment_id", "main_unifiedjob"."old_pk", "main_unifiedjob"."emitted_events", "main_unifiedjob"."unified_job_template_id", "main_unifiedjob"."created", "main_unifiedjob"."launch_type", "main_unifiedjob"."schedule_id", "main_unifiedjob"."execution_node", "main_unifiedjob"."controller_node", "main_unifiedjob"."cancel_flag", "main_unifiedjob"."status", "main_unifiedjob"."failed", "main_unifiedjob"."started", "main_unifiedjob"."dependencies_processed", "main_unifiedjob"."finished", "main_unifiedjob"."canceled_on", "main_unifiedjob"."elapsed", "main_unifiedjob"."job_args", "main_unifiedjob"."job_cwd", "main_unifiedjob"."job_env", "main_unifiedjob"."job_explanation", "main_unifiedjob"."start_args", "main_unifiedjob"."result_traceback", "main_unifiedjob"."celery_task_id", "main_unifiedjob"."instance_group_id", "main_unifiedjob"."organization_id", "main_unifiedjob"."installed_collections", "main_unifiedjob"."ansible_version", "main_unifiedjob"."host_status_counts", "main_unifiedjob"."work_unit_id" FROM "main_unifiedjob" LEFT OUTER JOIN "main_workflowapproval" ON ("main_unifiedjob"."id" = "main_workflowapproval"."unifiedjob_ptr_id") WHERE "main_workflowapproval"."unifiedjob_ptr_id" IS NULL ORDER BY "main_unifiedjob"."finished" DESC LIMIT 10 OFFSET 2094727
2022-08-11 09:12:27.050 CEST 62f4abd9.17207 08006 14 LOG:  unexpected EOF on client connection with an open transaction

As you can see the uWSGI process crashes during retreival of the data, leaving the PostgreSQL server with an unexpected EOF.

As I said, we haven't seen these issues lately, which reinforces the theory that it is related to failed jobs. Unfortunately this means I can't easily reproduce this.

2and3makes23 commented 2 years ago

@fosterseth

@fust @2and3makes23 do you get the 502 if you append ?x= when landing on the job output page e.g. #/jobs/playbook/10/output?x=

We will try that as soon as we hit another error-state job. Or, if you could shed some light on how to provoke a large amount of events in a playbook, we are happy to try that as well.

fust commented 2 years ago

@fosterseth We had a job triggering this error today so I checked. Unfortunately adding ?x= to the job output still gives us the 502 error.

jheinitz commented 2 years ago

We are running AWX version 21.1.0 and we see the memory footprint increasing. We are running all AWX on a single node k3s cluster and the 16GB memory was not enough. Looking forward to get a fix or workaround for this.

2and3makes23 commented 2 years ago

@fosterseth We have two Error-state jobs again (IDs: 161866, 161867), they derive from the same job template.

Like you I tried querying .../api/v2/jobs/161867/job_events/children_summary (same behavior for both jobs this morning):

2022/08/30 09:43:18 [error] 32#32: *45695 upstream prematurely closed connection while reading response header from upstream, client: <IP>, server: _, request: "GET /api/v2/jobs/161867/job_events/children_summary/ HTTP/1.1", upstream: "uwsgi://127.0.0.1:8050", host: "<awx_hostname>"
<IP> - - [30/Aug/2022:09:43:18 +0000] "GET /api/v2/jobs/161867/job_events/children_summary/ HTTP/1.1" 502 150 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0" "<IP>, <IP>"
DAMN ! worker 2 (pid: 182) died, killed by signal 9 :( trying respawn ...
Respawned uWSGI worker 2 (new pid: 185)
mounting awx.wsgi:application on /
WSGI app 0 (mountpoint='/') ready in 3 seconds on interpreter 0x10e1620 pid: 185 (default app)

Update: In the afternoon above query for both jobs worked while viewing respective job details .../api/v2/jobs/161867 still produce a 502.

And I tried your idea with ?x= appended /#/jobs/playbook/161866/output?x=, which produces 502 and the same uWSGI messages as above.

wtmthethird commented 2 years ago

For what it's worth, it looks like at least some of our jobs that are entering ERROR'd state are a result of this: https://github.com/ansible/awx/issues/9961 -> https://github.com/ansible/awx/issues/11338. And then, when we try to view the log output, that's when the AWX web container starts to run away with all the memory and crash. So that might be one way to replicate this.

fust commented 2 years ago

@wtmthethird At least for us it's unlikely that's the case. We've increased the maximum log size to 500MB and we're not seeing the behaviour in #11338 on any job after that. Why some jobs fail seems to be random, as well as why only failed jobs with large logs seem to produce this issue. They mostly chug along just fine but once in a while this problem occurs.

This makes it very hard to do any debugging on the uWSGI application which is further complicated by the fact it only happens in our production environment. I'd love to hear from @fosterseth if there is an easy way to trigger a failed job with a large log, I'd be happy to spend some time digging through the code to find the root cause.

fosterseth commented 2 years ago

@fust you could trigger a failed job by calling receptorctl work release on the work unit id

start the job and let it run, go the api/v2/jobs/jobid/ and look for work_unit_id, take note
kubectl exec -it deploy/awx -c awx-ee -- bash and run receptorctl --socket /var/run/awx-receptor/receptor.sock work release work_unit_id

you could also probably just do a kubectl delete pod on the job pod to trigger the failure

maybe wait until a lot of stdout is produce before running that command.

maybe open browser dev tools, go to network tab and navigate to the job detail page for that job. Which api request yields the 502?

fosterseth commented 2 years ago

@wtmthethird | Running the query on the database produces an ~11MB file. Also, we can see the query we run directly against the database in the logs as well, supporting the earlier question of why are only the ERROR'd jobs showing as queries against the DB?

wondering what the contents of the output is -- my guess is the result_traceback is a massive text blob? That would explain why you only see in failed jobs

fosterseth commented 2 years ago

@wtmthethird 's hint about the db tipped me off -- the sql they posted was essentially just a unified job record from the DB. The only field that could be arbitrarily big is result_traceback since it just a text blob. That would also explain why successful jobs were small, as the result_traceback would be empty.

@fust and @wtmthethird also mentioned issue #11338 as a source of this problem, so I started there

I got a reproducer by doing the following,

start minikube ssh into minikube and added this to the docker daemon config at /etc/docker/daemon.json

{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "15k",
    "max-file": "3",
    "labels": "production_status",
    "env": "os,customer"
  }
}

This limits the size of the stdout of containers to 15kb. service docker restart, then let the pods come back online. start a chatty playbook that will quickly generate stdout, and the job should fail pretty quickly.

navigate the job endpoint api/v2/jobs/id/

What you'll find the result_traceback contains the entire stdout of the that job. So that explains why api calls to this endpoint would OOM uwsgi if this field gets too large.

andreasbourges commented 2 years ago

Hi,

...we hit a similar issue, that might have the same root cause. Initially reported on google groups (see https://groups.google.com/g/awx-project/c/BTQ51PblMaI for the full thread), I was asked to supply some output from the awx deployment in this github issue.

-> what does "/api/v2/jobs/id/job_events/children_summary" return for the failed job?

result:

{ "children_summary": { "2": { "rowNumber": 1, "numChildren": 282 }, "3": { "rowNumber": 2, "numChildren": 281 }, "4": { "rowNumber": 3, "numChildren": 2 }, "7": { "rowNumber": 6, "numChildren": 277 } }, "meta_event_nested_uuid": {}, "event_processing_finished": true, "is_tree": true }

-> what does that same endpoint return for the same job, but when successful?

Result:

{"detail":"Nicht gefunden."} (which means "not found")

-> Do you get 502 errors when viewing the job detail page in the UI? open the browser dev tools > network tab. When you go the job output page in the UI, but don't scroll down. Do any requests give a 502 or take a long time to respond?

Result: No - No 502's and a resonable amount of time to respond (5s):

But - if I pick up the scrollbar with my mouse and move it half way down, I can see a lot of requests being sent to the backend and then there is a 502 ... Additionally, I can see the response size is between 1 and 7 MB - seems a bit large for loading the events. Here's a screenshot, after I moved the scrollbar down:

Once more - thanks for looking into this!

Best regards,

Andreas

fosterseth commented 2 years ago

@andreasbourges thanks for this detailed information!

my current running hypothesis is that uwsgi blows up because the result_traceback on the failed/error job is enormous.

Can you verify that that job has result_traceback that has tons of text?

andreasbourges commented 2 years ago

Hmmm... how would I verify this?

fosterseth commented 2 years ago

@andreasbourges sorry should have said, this is a field on the api endpoint of the job

so just navigate to api/v2/jobs/job_idand look for "result_traceback" (it probably goes off the right side of the screen, which you can scroll to view)

andreasbourges commented 2 years ago

Hi,

.your assumption seems to be correct - the result-traceback has a size of 15 MB (but is this enough to make uwsgi going nuts?). But what makes the job failing without further notice? Is it the sheer size of the data (looks like I receive the whole device configuration from nautobot for each of the ~ 400 devices - might be the problem .)

Will have a look at this.

Thanks,

Andreas

fosterseth commented 2 years ago

@andreasbourges the underlying issues seems to be container stdout log rotation, as described here https://github.com/ansible/awx/issues/11338. You need to make sure your environment has container max log size set to value that is sufficiently large. It should be larger than expected stdout of your ansible job

andreasbourges commented 2 years ago

Just to let you know - reducing the data received from the inventory and tune container log rotation made our jobs run stable again!

Thanks for your support!

fosterseth commented 2 years ago

we are attempting to address this issue here https://github.com/ansible/receptor/pull/683

ansible / awx

502 Bad Gateway Errors in AWX GUI browsing Jobs, Templates #12644

Please confirm the following

Bug Summary