Closed 2and3makes23 closed 1 year ago
+1 We are facing the same problem, I was just about to create a new issue when I saw this one.
I've been digging for a few hours and the TL;DR is that I traced the problem to jobs with a lot of output causing the uWSGI thread to crash. Increasing the memory for the awx-web container to 2GB fixed the job overview not loading correctly but we need to further increase it to be able to load the job output.
We saw the uWSGI thread get killed with a signal 9, which indicates an OOM scenario. Once we isolated it to a specific job ID and could reproduce the issue reliably we turned on query logging in PostgreSQL. This got us the exact offending query, almost immediatly followed by an EOF error. Running the query on the PostgreSQL server directly and sending the output to a file resulted in a ~700MB file (in human-readble form, so not what is actually sent to AWX).
As I stated at the start, increasing the request/limit for the awx-web container fixed the job overview and job details pages but the job output still won't load.
I'm hesitant to increase the memory even further, that just feels like masking the underlying problem. This only started happening after upgrading from 20.0.1 to 21.3.0
| This got us the exact offending query,
would you be able to get us that query? @fust
also wondering if same issue can be replicated by hitting the api/v2/jobs/<jobid>/job_events/children_summary/
endpoint -- this endpoint will attempt to load in some of the data for each event produced by that job. There is currently no cap on how much data it will attempt to load, and might be a cause of crashing uwsgi worker
doing some local testing
api/v2/jobs/12/job_events/children_summary/
endpointI can see uwsgi process spike up to 1.9 Gb memory usage
So if you have large number of events for a job, and multiple users (or even a single user refreshing the job output page) trying to view the stdout out for that job, you can easily breach 2 Gb for the entire container.
@fust @2and3makes23 do you get the 502 if you append ?x=
when landing on the job output page
e.g. #/jobs/playbook/10/output?x=
this tricks the browser into avoiding loading all events at once for that job
We're seeing this same behavior with 21.2.0 as well, and also only on jobs that enter an "ERROR" state. Other jobs, even with large outputs/task, display correctly, albeit slowly.
However, a user viewing a job's output that is in an "ERROR" state crashes the awx-p-web container, I've seen 5gb+ be requested before it OOMs.
We're running in GKE, and we're using cloud sql. In the pgsql logs (we have not turned logging on for all queries, though) one oddity that does come up is that it appears to be doing a DB query for the pods in an ERROR state, but we can't see corresponding DB queries for pods that are in a successful or failed state. I do not know if this is relevant or not, but it is rather confusing that we're only seeing queries against the database if we're trying to view an ERROR'd job.
The query that's being run is:
db=awxp,user=awx STATEMENT: SELECT "main_unifiedjob"."id", "main_unifiedjob"."polymorphic_ctype_id", "main_unifiedjob"."modified", "main_unifiedjob"."description", "main_unifiedjob"."created_by_id", "main_unifiedjob"."modified_by_id", "main_unifiedjob"."name", "main_unifiedjob"."execution_environment_id", "main_unifiedjob"."old_pk", "main_unifiedjob"."emitted_events", "main_unifiedjob"."unified_job_template_id", "main_unifiedjob"."created", "main_unifiedjob"."launch_type", "main_unifiedjob"."schedule_id", "main_unifiedjob"."execution_node", "main_unifiedjob"."controller_node", "main_unifiedjob"."cancel_flag", "main_unifiedjob"."status", "main_unifiedjob"."failed", "main_unifiedjob"."started", "main_unifiedjob"."dependencies_processed", "main_unifiedjob"."finished", "main_unifiedjob"."canceled_on", "main_unifiedjob"."elapsed", "main_unifiedjob"."job_args", "main_unifiedjob"."job_cwd", "main_unifiedjob"."job_env", "main_unifiedjob"."job_explanation", "main_unifiedjob"."start_args", "main_unifiedjob"."result_traceback", "main_unifiedjob"."celery_task_id", "main_unifiedjob"."instance_group_id", "main_unifiedjob"."organization_id", "main_unifiedjob"."installed_collections", "main_unifiedjob"."ansible_version", "main_unifiedjob"."host_status_counts", "main_unifiedjob"."work_unit_id", "main_job"."unifiedjob_ptr_id", "main_job"."survey_passwords", "main_job"."custom_virtualenv", "main_job"."webhook_service", "main_job"."webhook_credential_id", "main_job"."webhook_guid", "main_job"."diff_mode", "main_job"."job_type", "main_job"."inventory_id", "main_job"."project_id", "main_job"."playbook", "main_job"."scm_branch", "main_job"."forks", "main_job"."limit", "main_job"."verbosity", "main_job"."extra_vars", "main_job"."job_tags", "main_job"."force_handlers", "main_job"."skip_tags", "main_job"."start_at_task", "main_job"."become_enabled", "main_job"."allow_simultaneous", "main_job"."timeout", "main_job"."use_fact_cache", "main_job"."job_template_id", "main_job"."artifacts", "main_job"."scm_revision", "main_job"."project_update_id", "main_job"."job_slice_number", "main_job"."job_slice_count" FROM "main_job" INNER JOIN "main_unifiedjob" ON ("main_job"."unifiedjob_ptr_id" = "main_unifiedjob"."id") WHERE "main_job"."unifiedjob_ptr_id" = 7997 LIMIT 21
Running the query on the database produces an ~11MB file. Also, we can see the query we run directly against the database in the logs as well, supporting the earlier question of why are only the ERROR'd jobs showing as queries against the DB?
When we run this same query against an identical playbook run, but one a week prior that did not enter an error'd state, we see a 5.6KB file instead of 11MB. That is a striking difference in and of itself.
We also tried to search through the code to find where this is coming from, and could not locate a "LIMIT 21", and also can't find any inner joins with exception of those against the rbac tables.
Lastly, hitting /api/v2/jobs/
{"children_summary":{},"meta_event_nested_uuid":{},"event_processing_finished":true,"is_tree":false}
You just tested this as I was typing, but I want to point out that our issues are surrounding jobs that enter an "ERROR" state, successful jobs we see memory spikes, but nothing crashes.
(A second lastly, since you commented!):
Hitting /#/jobs/playbook/
Thanks for looking into this.
I've had to search for a bit to find the query, we haven't seen any of these issues as of late. That might be because I've been hunting down and fixing all sorts of things to minimize the number of failing jobs. As @wtmthethird stated it seems to be triggered on failed jobs alone.
The relevent part from the PostgreSQL log is this:
2022-08-11 09:12:25.675 CEST 62f4abd9.17207 00000 11 LOG: statement: BEGIN94727
2022-08-11 09:12:25.676 CEST 62f4abd9.17207 00000 12 LOG: statement: SELECT COUNT(*) AS "__count" FROM "main_unifiedjob" LEFT OUTER JOIN "main_workflowapproval" ON ("main_unifiedjob"."id" = "main_workflowapproval"."unifiedjob_ptr_id") WHERE "main_workflowapproval"."unifiedjob_ptr_id" IS NULL94727 2022-08-11 09:12:25.679 CEST 62f4abd9.17207 00000 13 LOG: statement: SELECT "main_unifiedjob"."id", "main_unifiedjob"."polymorphic_ctype_id", "main_unifiedjob"."modified", "main_unifiedjob"."description", "main_unifiedjob"."created_by_id", "main_unifiedjob"."modified_by_id", "main_unifiedjob"."name", "main_unifiedjob"."execution_environment_id", "main_unifiedjob"."old_pk", "main_unifiedjob"."emitted_events", "main_unifiedjob"."unified_job_template_id", "main_unifiedjob"."created", "main_unifiedjob"."launch_type", "main_unifiedjob"."schedule_id", "main_unifiedjob"."execution_node", "main_unifiedjob"."controller_node", "main_unifiedjob"."cancel_flag", "main_unifiedjob"."status", "main_unifiedjob"."failed", "main_unifiedjob"."started", "main_unifiedjob"."dependencies_processed", "main_unifiedjob"."finished", "main_unifiedjob"."canceled_on", "main_unifiedjob"."elapsed", "main_unifiedjob"."job_args", "main_unifiedjob"."job_cwd", "main_unifiedjob"."job_env", "main_unifiedjob"."job_explanation", "main_unifiedjob"."start_args", "main_unifiedjob"."result_traceback", "main_unifiedjob"."celery_task_id", "main_unifiedjob"."instance_group_id", "main_unifiedjob"."organization_id", "main_unifiedjob"."installed_collections", "main_unifiedjob"."ansible_version", "main_unifiedjob"."host_status_counts", "main_unifiedjob"."work_unit_id" FROM "main_unifiedjob" LEFT OUTER JOIN "main_workflowapproval" ON ("main_unifiedjob"."id" = "main_workflowapproval"."unifiedjob_ptr_id") WHERE "main_workflowapproval"."unifiedjob_ptr_id" IS NULL ORDER BY "main_unifiedjob"."finished" DESC LIMIT 10 OFFSET 2094727
2022-08-11 09:12:27.050 CEST 62f4abd9.17207 08006 14 LOG: unexpected EOF on client connection with an open transaction
As you can see the uWSGI process crashes during retreival of the data, leaving the PostgreSQL server with an unexpected EOF.
As I said, we haven't seen these issues lately, which reinforces the theory that it is related to failed jobs. Unfortunately this means I can't easily reproduce this.
@fosterseth
@fust @2and3makes23 do you get the 502 if you append ?x= when landing on the job output page e.g. #/jobs/playbook/10/output?x=
We will try that as soon as we hit another error-state job. Or, if you could shed some light on how to provoke a large amount of events in a playbook, we are happy to try that as well.
@fosterseth
We had a job triggering this error today so I checked.
Unfortunately adding ?x=
to the job output still gives us the 502 error.
We are running AWX version 21.1.0
and we see the memory footprint increasing. We are running all AWX on a single node k3s cluster and the 16GB memory was not enough. Looking forward to get a fix or workaround for this.
@fosterseth We have two Error-state jobs again (IDs: 161866, 161867), they derive from the same job template.
Like you I tried querying .../api/v2/jobs/161867/job_events/children_summary
(same behavior for both jobs this morning):
2022/08/30 09:43:18 [error] 32#32: *45695 upstream prematurely closed connection while reading response header from upstream, client: <IP>, server: _, request: "GET /api/v2/jobs/161867/job_events/children_summary/ HTTP/1.1", upstream: "uwsgi://127.0.0.1:8050", host: "<awx_hostname>"
<IP> - - [30/Aug/2022:09:43:18 +0000] "GET /api/v2/jobs/161867/job_events/children_summary/ HTTP/1.1" 502 150 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0" "<IP>, <IP>"
DAMN ! worker 2 (pid: 182) died, killed by signal 9 :( trying respawn ...
Respawned uWSGI worker 2 (new pid: 185)
mounting awx.wsgi:application on /
WSGI app 0 (mountpoint='/') ready in 3 seconds on interpreter 0x10e1620 pid: 185 (default app)
Update: In the afternoon above query for both jobs worked while viewing respective job details .../api/v2/jobs/161867
still produce a 502.
And I tried your idea with ?x=
appended /#/jobs/playbook/161866/output?x=
, which produces 502 and the same uWSGI messages as above.
For what it's worth, it looks like at least some of our jobs that are entering ERROR'd state are a result of this: https://github.com/ansible/awx/issues/9961 -> https://github.com/ansible/awx/issues/11338. And then, when we try to view the log output, that's when the AWX web container starts to run away with all the memory and crash. So that might be one way to replicate this.
@wtmthethird At least for us it's unlikely that's the case. We've increased the maximum log size to 500MB and we're not seeing the behaviour in #11338 on any job after that. Why some jobs fail seems to be random, as well as why only failed jobs with large logs seem to produce this issue. They mostly chug along just fine but once in a while this problem occurs.
This makes it very hard to do any debugging on the uWSGI application which is further complicated by the fact it only happens in our production environment. I'd love to hear from @fosterseth if there is an easy way to trigger a failed job with a large log, I'd be happy to spend some time digging through the code to find the root cause.
@fust you could trigger a failed job by calling receptorctl work release on the work unit id
kubectl exec -it deploy/awx -c awx-ee -- bash
and run receptorctl --socket /var/run/awx-receptor/receptor.sock work release work_unit_idyou could also probably just do a kubectl delete pod
on the job pod to trigger the failure
maybe wait until a lot of stdout is produce before running that command.
maybe open browser dev tools, go to network tab and navigate to the job detail page for that job. Which api request yields the 502?
@wtmthethird | Running the query on the database produces an ~11MB file. Also, we can see the query we run directly against the database in the logs as well, supporting the earlier question of why are only the ERROR'd jobs showing as queries against the DB?
wondering what the contents of the output is -- my guess is the result_traceback
is a massive text blob? That would explain why you only see in failed jobs
@wtmthethird 's hint about the db tipped me off -- the sql they posted was essentially just a unified job record from the DB. The only field that could be arbitrarily big is result_traceback
since it just a text blob. That would also explain why successful jobs were small, as the result_traceback would be empty.
@fust and @wtmthethird also mentioned issue #11338 as a source of this problem, so I started there
I got a reproducer by doing the following,
start minikube
ssh into minikube and added this to the docker daemon config at /etc/docker/daemon.json
{
"log-driver": "json-file",
"log-opts": {
"max-size": "15k",
"max-file": "3",
"labels": "production_status",
"env": "os,customer"
}
}
This limits the size of the stdout of containers to 15kb.
service docker restart
, then let the pods come back online.
start a chatty playbook that will quickly generate stdout, and the job should fail pretty quickly.
navigate the job endpoint api/v2/jobs/id/
What you'll find the result_traceback
contains the entire stdout of the that job. So that explains why api calls to this endpoint would OOM uwsgi if this field gets too large.
Hi,
...we hit a similar issue, that might have the same root cause. Initially reported on google groups (see https://groups.google.com/g/awx-project/c/BTQ51PblMaI for the full thread), I was asked to supply some output from the awx deployment in this github issue.
-> what does "/api/v2/jobs/id/job_events/children_summary" return for the failed job?
result:
{ "children_summary": { "2": { "rowNumber": 1, "numChildren": 282 }, "3": { "rowNumber": 2, "numChildren": 281 }, "4": { "rowNumber": 3, "numChildren": 2 }, "7": { "rowNumber": 6, "numChildren": 277 } }, "meta_event_nested_uuid": {}, "event_processing_finished": true, "is_tree": true }
-> what does that same endpoint return for the same job, but when successful?
Result:
{"detail":"Nicht gefunden."} (which means "not found")
-> Do you get 502 errors when viewing the job detail page in the UI? open the browser dev tools > network tab. When you go the job output page in the UI, but don't scroll down. Do any requests give a 502 or take a long time to respond?
Result: No - No 502's and a resonable amount of time to respond (5s):
But - if I pick up the scrollbar with my mouse and move it half way down, I can see a lot of requests being sent to the backend and then there is a 502 ... Additionally, I can see the response size is between 1 and 7 MB - seems a bit large for loading the events. Here's a screenshot, after I moved the scrollbar down:
Once more - thanks for looking into this!
Best regards,
Andreas
@andreasbourges thanks for this detailed information!
my current running hypothesis is that uwsgi blows up because the result_traceback on the failed/error job is enormous.
Can you verify that that job has result_traceback that has tons of text?
Hmmm... how would I verify this?
@andreasbourges sorry should have said, this is a field on the api endpoint of the job
so just navigate to api/v2/jobs/job_id
and look for "result_traceback" (it probably goes off the right side of the screen, which you can scroll to view)
Hi,
.your assumption seems to be correct - the result-traceback has a size of 15 MB (but is this enough to make uwsgi going nuts?). But what makes the job failing without further notice? Is it the sheer size of the data (looks like I receive the whole device configuration from nautobot for each of the ~ 400 devices - might be the problem .)
Will have a look at this.
Thanks,
Andreas
@andreasbourges the underlying issues seems to be container stdout log rotation, as described here https://github.com/ansible/awx/issues/11338. You need to make sure your environment has container max log size set to value that is sufficiently large. It should be larger than expected stdout of your ansible job
Just to let you know - reducing the data received from the inventory and tune container log rotation made our jobs run stable again!
Thanks for your support!
we are attempting to address this issue here https://github.com/ansible/receptor/pull/683
Please confirm the following
Bug Summary
When browsing Jobs or other job related lists via UI or API, nginx produces 502 Bad Gateway errors, when certain Jobs are part of the result to be shown to the user.
Those jobs have in common that they ended in error state.
Log entries that might be of interest:
...
[10/Aug/2022:08:12:02 +0000] "GET /websocket/broadcast/ HTTP/1.1" 502 150 "-" "Python/3.9 aiohttp/3.7.4" "-" 2022/08/10 08:12:02 [error] 31#31: *1 connect() failed (111: Connection refused) while connecting to upstream, client:, server: _, request: "GET /websocket/broadcast/ HTTP/1.1", upstream: "http://127.0.0.1:8051/websocket/broadcast/", host: ""
2022/08/10 08:12:02 [error] 31#31: *1 connect() failed (111: Connection refused) while connecting to upstream, client:, server: _, request: "GET /websocket/broadcast/ HTTP/1.1", upstream: "http://127.0.0.1:8051/websocket/broadcast/", host: ":8052"