AWX RuntimeError resulting in loss of worker task logs and true outcome

duntonr commented 3 years ago

ISSUE TYPE

Bug Report

SUMMARY

AWX UI reporting stalls eventually errors out with unhelpful

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1397, in run
    res = receptor_job.run()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 2957, in run
    return self._run_internal(receptor_ctl)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3008, in _run_internal
    raise RuntimeError(detail)

In reality, though, the job has actually continued and either completed successfully OR failed due to a normal (helpful) job/task/play error. This is found by looking at the spawned worker container logs.

The UI error IS raised at the same time as when the JOB completes.

Downloading logs from the UI results in the INCOMPLETE log set, eg not the apparent zip file the worker container uploads back

AWX reports the job as a failure in the job list, even though it succeeded from the actual container logs. This is VERY confusing.

ENVIRONMENT

AWX version: 19.0.0
AWX install method: AWX Operator 0.8.0 via microk8s 1.4.4
Ansible version: 2.11.0b4
Operating System: Ubuntu 20.04
Web Browser: Brave 1.23.71 / Chromium 90.0.4430.72

STEPS TO REPRODUCE

Use standard AWX EE 0.1.1 or derived custom EE container for a long-running/verbose job
Track job progress from both AWX UI as well as container logs from k8s (dashboard, kubectl, etc)
After some point, the UI stops tracking
Continue to monitor the container output of the job
Compare results from job container output to AWX UI (note you need to adjust the cleanup timing in AWX or be on top of tailing / dumping the container logs as the container is cleaned up once the job exits

EXPECTED RESULTS

AWX UI Job output continues to stay synced with actual job output and status match

ACTUAL RESULTS

AWX UI logging falls out of sync from execution and reports job failed with .../tasks.py error, despite actual job outcome. If there was an actual Job error, this is not displayed (so could not tell WHY a job failed without worker pod logs).

ADDITIONAL INFORMATION

After continuously tailing the worker pod logs, I was able to see this:

...
{"uuid": "cfed869a-3ecb-4919-9301-5479b816bec1", "counter": 10218, "stdout": "\r\nPLAY RECAP *********************************************************************\r\n\u001b[0;33mconsul01.internal.therack.io\u001b[0m : \u001b[0;32mok=109 \u001b[0m \u001b[0;33mchanged=39  \u001b[0m unreachable=0    failed=0    \u001b[0;36mskipped=37  \u001b[0m rescued=0    \u001b[1;35mignored=4   \u001b[0m\r\n\u001b[0;33mconsul02.internal.therack.io\u001b[0m : \u001b[0;32mok=97  \u001b[0m \u001b[0;33mchanged=27  \u001b[0m unreachable=0    failed=0    \u001b[0;36mskipped=49  \u001b[0m rescued=0    \u001b[1;35mignored=1   \u001b[0m\r\n\u001b[0;33mconsul03.internal.therack.io\u001b[0m : \u001b[0;32mok=97  \u001b[0m \u001b[0;33mchanged=27  \u001b[0m unreachable=0    failed=0    \u001b[0;36mskipped=49  \u001b[0m rescued=0    \u001b[1;35mignored=1   \u001b[0m\r\n\u001b[0;33mmaas.internal.therack.io\u001b[0m   : \u001b[0;32mok=63  \u001b[0m \u001b[0;33mchanged=21  \u001b[0m unreachable=0    failed=0    \u001b[0;36mskipped=22  \u001b[0m rescued=0    \u001b[1;35mignored=3   \u001b[0m\r\n\u001b[0;33mmarathon.internal.therack.io\u001b[0m : \u001b[0;32mok=64  \u001b[0m \u001b[0;33mchanged=22  \u001b[0m unreachable=0    failed=0    \u001b[0;36mskipped=21  \u001b[0m rescued=0    \u001b[1;35mignored=3   \u001b[0m\r\n", "start_line": 30927, "end_line": 30935, "runner_ident": "209", "event": "playbook_on_stats", "job_id": 209, "pid": 20, "created": "2021-04-20T17:51:16.379887", "parent_uuid": "87632333-0e50-47fe-8754-3fe63b1b5216", "event_data": {"playbook": "playbooks/controlDeck/common/configureServices.yml", "playbook_uuid": "87632333-0e50-47fe-8754-3fe63b1b5216", "changed": {"consul01.internal.therack.io": 39, "consul02.internal.therack.io": 27, "consul03.internal.therack.io": 27, "marathon.internal.therack.io": 22, "maas.internal.therack.io": 21}, "dark": {}, "failures": {}, "ignored": {"consul01.internal.therack.io": 4, "consul02.internal.therack.io": 1, "consul03.internal.therack.io": 1, "marathon.internal.therack.io": 3, "maas.internal.therack.io": 3}, "ok": {"consul01.internal.therack.io": 109, "consul02.internal.therack.io": 97, "consul03.internal.therack.io": 97, "marathon.internal.therack.io": 64, "maas.internal.therack.io": 63}, "processed": {"consul01.internal.therack.io": 1, "consul02.internal.therack.io": 1, "consul03.internal.therack.io": 1, "marathon.internal.therack.io": 1, "maas.internal.therack.io": 1}, "rescued": {}, "skipped": {"consul01.internal.therack.io": 37, "consul02.internal.therack.io": 49, "consul03.internal.therack.io": 49, "marathon.internal.therack.io": 21, "maas.internal.therack.io": 22}, "artifact_data": {}, "uuid": "cfed869a-3ecb-4919-9301-5479b816bec1"}}
{"status": "successful", "runner_ident": "209"}
{"zipfile": 675955}
P)h>@6aWAK2mrv0lu{6R!x41^000*a000L7003ieZEaz0WUW?PZ=....<more binary dump, I'm assuming its a zip file>

As you can see, job 209 finished successfully.

However

I also inspected the container logs for the redis, awx-web, awx-task, and awx-ee in the AWX pod but didn't see anything immediately apparently around the time AWX UI stopped tracking

shanemcd commented 3 years ago

Interesting. Some questions:

Does this happen consistently?
What can you tell us about the workload itself?
- How many tasks?
- Are you producing a lot of stdout?
- Are you making heavy use of ansible facts?

duntonr commented 3 years ago

Hi @shanemcd

Does this happen consistently?
- So far, it seems intermittent or at least is not triggered if the job naturally hits an error earlier in the run. I have a limited amount of data so far (decided to rebuild my AWX environment from scratch due to issues upgrading from pre-18) and am working through configuring (and improving) my playbooks and templates along the way.
Workload
- It is a rather heavy set of plays. It does things like configure partitions/fs/exports, install docker, gusterfs, consul, nomad, vault, do system updates, etc. This is across 5 hosts.
- The UI seemed to only capture 23 plays and 445 tasks. Based on the pod log, its more like 80 plays
- It's currently sent to verbose (level 3) output and if I download the incomplete file from UI, its ~2.5M. When downloading the worker pod logs, it was getting close to 10mb (but the formatting is different there)
- It does make use of facts throughout the plays, eg to determine if a host is a gusterfs server or client, nomad server or client, etc

When running with AWX 17, via local Docker Compose, I did not have these issues for a very similar amount of tasks/stdio

I have some additional templates/playbooks to bring over, including a larger one that will run over ~36 hosts and will keep an eye out there for the same behavior

shanemcd commented 3 years ago

@duntonr Thanks for the information. We did indeed make some pretty radical changes to the architecture of AWX in version 18, so that explains why you weren't seeing this issue before.

I will spend some time tomorrow trying to reproduce this, or at very least improve the error handling.

shanemcd commented 3 years ago

@duntonr One more question: is this error the only thing you see in the stdout? Or was there output before the traceback?

duntonr commented 3 years ago

Thanks @shanemcd !

For what its worth, I came across https://github.com/ansible/awx/issues/9917 and https://groups.google.com/g/awx-project/c/MACNtPrGpV8 as those are the only google hits for File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1397, in run

Separately, I had similar issues as reported in 9917 but was able to eventually get a custom EE container builtt and working (to include Galaxy collections) by using the workarounds mentioned there. I needed to use your suggested --build-arg ANSIBLE_RUNNER_IMAGE=quay.io/ansible/awx-ee:0.1.1

If though, I tried the (very new) --build-arg ANSIBLE_RUNNER_IMAGE=quay.io/ansible/awx-ee:0.2.0, I would get

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1397, in run
    res = receptor_job.run()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 2957, in run
    return self._run_internal(receptor_ctl)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3008, in _run_internal
    raise RuntimeError(detail)
RuntimeError: Pod Running

I would get that same error if I tried the out-of-the-box 0.2.0 quay image as well. I can report the error over there, but mentioning it here in case it helps shed any light on the AWX Runtime vs worker pod relationships

duntonr commented 3 years ago

@duntonr One more question: is this error the only thing you see in the stdout? Or was there output before the traceback?

That's an interesting thing as well... when the job starts, things are normal, eg the log starts with

Enter passphrase for /runner/artifacts/209/ssh_key_data: 
Identity added: /runner/artifacts/209/ssh_key_data (duntonr@gmail.com)
ansible-playbook [core 2.11.0b4] 
...

I can keep scrolling though the logs via AWX UI (double down carrots) until it stalls. At that point, the flashy green indicator keeps flashing to show the job is still running but no new logs.

When the job does end, the indicator will turn red, but again, no new logs/errors/etc at the bottom of the logging window

If I then refresh the Job Output screen, that's when the Traceback error shows up, at the TOP of the log

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1397, in run
    res = receptor_job.run()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 2957, in run
    return self._run_internal(receptor_ctl)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3008, in _run_internal
    raise RuntimeError(detail)
RuntimeError: Finished
Enter passphrase for /runner/artifacts/209/ssh_key_data: 
Identity added: /runner/artifacts/209/ssh_key_data (duntonr@gmail.com)
ansible-playbook [core 2.11.0b4] 
...

duntonr commented 3 years ago

Just by way of update, I have been able to run a different playbook across different "stacks" of 4 hosts/stack WITHOUT issue so far.

I've made 5 runs of a template that consists of:

4 hosts
4 plays
~528 tasks
~ 37minutes
L3 logging, resulting in download size of ~71KB log txt file when finished
uses facts

This playbook/template is a slightly lighter version of the one I was having issues with. That said it's mostly the same software and actually imports a lot of the same task files....just these are worker nodes, vs the control/server hosts where the issue was.

This is a home lab, so there aren't any crazy vlaning/security issues running between the two (same subnet, etc) host groups. The only major difference I could think of is the problematic playbook runs against x86-64 hosts, whereas this "worker" playbook runs against arm64v8 hosts. Should not make a difference for this it seems, but figured I would mention it.

Once this rollout is finished, I will run a playbook that touches all ~35 hosts and see whats happens

DrackThor commented 3 years ago

@duntonr I'm experiencing quite the same issues - I often get this output:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1397, in run
    res = receptor_job.run()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 2957, in run
    return self._run_internal(receptor_ctl)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3008, in _run_internal
    raise RuntimeError(detail)
RuntimeError: Finished

I've recently got my custom EE working on my test cluster (#9917 ), where I also dont get this error. On my second cluster however I keep getting this RuntimeError quite frequently.

ENVIRONMENT AWX version: 19.0.0 AWX install method: AWX Operator 0.8.0 via kubeadm 1.19.10 Ansible version - awx-ee: core 2.12.0.dev0 Operating System: Ubuntu 18.04 Web Browser: Chromium 90.0.4430.93

nlvw commented 3 years ago

I was receiving the same error, and behavior in 19.0.0. In 19.1.0 I no longer receive the error but the behavior remains. I cannot run any job with more then 3 hosts without the job failing mysteriously with cutoff output and no errors.

DrackThor commented 3 years ago

I was receiving the same error, and behavior in 19.0.0. In 19.1.0 I no longer receive the error but the behavior remains. I cannot run any job with more then 3 hosts without the job failing mysteriously with cutoff output and no errors.

I was able to get my EE running like I mentioned in #9917, maybe this also helps you 😄

nlvw commented 3 years ago

@DrackThor I'm using the builtin EE and not a custom one. I've read through the issue you mentioned but failed to see how it address the problem mentioned here.

duntonr commented 3 years ago

Thanks @DrackThor . I'm using a custom EE too but the issue seems to happen with a custom EE (with Galaxy stuff installed in the image) or the standard EE (with Galaxy stuff cobbled right into the repo that hold my playbooks)

I did run a larger set of plays... 28 hosts, ~ 3.5hrs run time, etc... issue still occurs

It's somewhat interesting that error is injected at the TOP of the output in awx.

Also, the error seems to occur quicker on my play against the 5 x86 hosts vs the play against 28 arm64v8 hosts

shanemcd commented 3 years ago

@duntonr I tried to improve the error handling in 19.1.0. Can you try again and paste the new error?

shanemcd commented 3 years ago

I think we've finally gotten to the bottom of this. Testing patch at https://github.com/ansible/receptor/pull/319

rysolv-bot commented 3 years ago

travelingladybug has contributed $50.00 to this issue on Rysolv.

The total bounty is now $50.00. Solve this issue on Rysolv to earn this bounty.

duntonr commented 3 years ago

@shanemcd - Sorry for the delay but 19.1 did NOT solve the issue. The same issue/behavior remains with:

Operator 0.9
AWX 19.1
Custom awx-ee image, based off quay.io/ansible/awx-ee:0.2.0 (just using Dockerfile FROM, not using ansible-builder anymore)

I was excited by ansible/receptor#319, until I read that issues updates :( . It does "feel" kinda like a lock or contention type issue though

shanemcd commented 3 years ago

In terms of what it means to test this: we shouldn't see this error anymore.

This was fixed in the latest alpha release of Receptor, which is going out in AWX 19.2.0 sometime today or tomorrow.

duntonr commented 2 years ago

Thanks @shanemcd ! I will upgrade operator to 0.10 / awx to 19.2 this weekend and rerun plays and report back

duntonr commented 2 years ago

unfortunately, the issue seems to persist with 19.2. after the first attempt. Playbook reporting hung at basically the same spot as before.

I will try again and see if I can snag better logs out of the pods

shanemcd commented 2 years ago

Hi @duntonr - sorry to hear that…

can you please confirm that you are using awx-ee 0.3.0, or have built on top of it?

hurdigurdi commented 2 years ago

Hi @shanemcd I'm running AWX 19.2.0 installed via Operator 0.10.0 and AWX-EE 0.3.0. Unfortunately, I've also run into

RuntimeError: Pod Running

using a job with 10 slices a 10 forks.

Have not been able to get more logs from the pods yet, but its consistently happening when running against a big set of servers with multiple slices.

duntonr commented 2 years ago

I'm still testing but I actually had success with a custom AWX-EE 0.2.0 image (NOT 0.3.0) with AWX 19.2 for a long run (4.5 hours / 28 hosts).

I'm trying different combinations of AWX-EE 0.2.0 and 0.3.0 with the two main offending playbooks to see if I can provide better info

kmtenhouse commented 2 years ago

@duntonr Might you be able to share what you customized in your 0.2.0 image for this? I'm facing a similar issue in 19.0.0 for a very long-running playbook (~4.5 hours, 1000+ hosts).

I am also planning to move to 19.2.1 + newer EE to see if that helps; wondering if I should go with 0.2.0 or jump to 0.4.0.

kladiv commented 2 years ago

Hi all, same issue RuntimeError: Pod Running using:

AWX 19.0.0
EE 0.2.0 or 0.3.0 or 0.4.0
Instance Group: tower (container group)
Ubuntu 20.04
Kubernetes OS: k3s v1.20.7+k3s1

Tested on playbook with single target node (so no massive run)

kazigk commented 2 years ago

I encountered similar issue. The difference is I'm not getting any RuntimeError nor play recap in pod logs.

I'm using:

AWX 19.2.0
AWX EE 0.4.0 (also tested with 0.2.0 and 0.3.0)
AWX Operator 0.10.0
Ubuntu 20.04 running microk8s v1.20.7

It happens to me while running a verbose job (about 1700 lines of logs) with strategy: free on 100+ hosts. Less verbose jobs and jobs with linear strategy aren't causing any issues. Running the same job on a smaller part of inventory also helps.

More information: ansible/awx-ee#75

kdelee commented 2 years ago

I tested a fix @shanemcd has on a branch.....that fixes this error message handling...all we need now is for that to get merged to awx@devel

kdelee commented 2 years ago

should be fixed by 1ed170fff096dfdee01c6d6c449632c6ac72cff4

shanemcd commented 2 years ago

To be clear, the unhelpful RuntimeError: Pod Running error should be fixed by the aforementioned commit.

kazigk commented 2 years ago

1ed170f didn't fix it for me. I'm currently running:

AWX Operator 0.12.0
AWX 19.2.2 (which includes said commit)
AWX EE 0.5.0

Update: I tried once again on the same playbook and it finished without any error. 115 hosts, 2500 lines of logs (stdout). Sadly I can't get logs from the previous run as pod is already deleted. I will post an update if I get any issues in near future.

kazigk commented 2 years ago

I'm still facing the issue, but in my case I'm not getting any error nor play recap in automation-job pod logs. The job finishes like it should, but the output is incomplete, so AWX can't even mark hosts as failed, which is very unhelpful. Running the same job on smaller part of inventory (or just splicing it) does solve the issue, but it's less readable (a few different stdout logs to look at isn't ideal).

Maybe I didn't understand what the commit mentioned above should fix or I'm encountering a different issue that just fits the description?

daenney commented 2 years ago

I'm observing this issue too. We initially had a run that ended with the following traceback and a bunch of log output truncated.

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1397, in run
    res = receptor_job.run()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 2957, in run
    return self._run_internal(receptor_ctl)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3008, in _run_internal
    raise RuntimeError(detail)
RuntimeError: Finished

It was really slow, so we ran it again with bigger instances powering it, and now we've ended up with:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1397, in run
    res = receptor_job.run()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 2957, in run
    return self._run_internal(receptor_ctl)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3008, in _run_internal
    raise RuntimeError(detail)
RuntimeError: Pod Running

I was following the logs with kubectl, and it ends with:

rpc error: code = Unknown desc = Error: No such container: 51b1416e42f635758859cfd2890a9f1021e48d56ff5f2006835c52bcb8483fd8

The logs right before it where just normal play output. Like the container just terminated.

Doing the same job in smaller batches seems to do the trick for us too.

daenney commented 2 years ago

We've also had a case where running with a larger batch it makes it to the end of the run, but seems to crash when flushing the recap.

The last of the logs in the pod show:

{"status": "failed", "runner_ident": "29178"}
{"zipfile": 28832}
P)h>@6a [..] some output that looks like it's straight out of The Matrix [..] 3SpWb400{"eof": true}

The traceback was:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1397, in run
    res = receptor_job.run()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 2957, in run
    return self._run_internal(receptor_ctl)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3008, in _run_internal
    raise RuntimeError(detail)
RuntimeError: Finished

davilmb commented 2 years ago

Hi @shanemcd and everyone!

I tested the fix mentioned in version 19.2.0, but without success, the same happened when upgrading to 19.2.2 and the new one release 19.3.0.

For the tests I used the awx-ee default (0.5.0) and also the 0.6.0 even the latest tag, always the same problem, after 4 hours of execution the Job is marked with "Error", even if it's terminate successfully in container log.

As an example below:

Note: I also ran the test on a new installation with version 18.0.0 (where there was no report of the problem), but I had the same problem and this didn't happen in the version that I was using previously (14.1.0).

Is there any palliative measure that we can use in this cases ?

Awaiting further comments regarding this issue.

nlvw commented 2 years ago

I've only found 2 solutions to this problem. Either have your templates split your job using a high number of slices or revert back to an earlier version of AWX before all of this Kubernetes only refactor stuff began. I really hope this issue gets more attention as it has made the product unusable.

If you deploy AWX 16 (before the incomplete UI redesign for stability) you can update ansible within the docker container to 10.x relatively safely. This will give you the git collection downloads from the requirements.yml file. The steps are relatively simple just anter a bash shell in the ansible container, yum uninstall ansible, and then pip install the desired ansible version.

nicovs commented 2 years ago

See: https://github.com/ansible/awx/issues/10366#issuecomment-944060894

kladiv commented 2 years ago

Hi @nicovs , i tested the suggested changes (in #10366 comment) in our infrastructure (k3s) but the job still failing after 4:02 hours.

AWX 19.3.0
AWX EE (latest) built-in installed by AWX installation

Job is a Workflow Job, and one playbook runs a task like below:

  raw: "ps -ef | grep -w /opt/xensource/sm/LVMoISCSISR | grep -v grep | grep -wq vdi_delete"
  register: quiesce_ps
  failed_when: false
  until: quiesce_ps.rc == 1
  retries: "{{ quiesce_wait_max_retries }}"
  delay: "{{ quiesce_wait_retries_delay }}"
  become: no
  delegate_to: "{{ xen_pool_master_inventory_hostname }}"

or like below (i tried a change to check if got same error):

  shell: >-
    RC=0;
    while [ $RC -eq 0 ]; do
      sleep 60;
      ps -ef | grep -w /opt/xensource/sm/LVMoISCSISR | grep -v grep | grep -wq vdi_delete;
      RC=$?;
    done
  register: quiesce_status
  async: 10800 # 3 hrs
  poll: 60 # 1 min
  become: no
  delegate_to: "{{ xen_pool_master_inventory_hostname }}"

Both the until/retries/delay task and async/poll task fails the Job without any error after about 4hrs. Every time it runs, it fails after 4 hrs.

Another playbook task (it makes XenServer big VM export via command module) fails the Job after about 14hrs without any error:

Below the logs i see in _/var/log/pods/default_awx-777854cdfb-z2bs4b24bc0a5-ca74-4561-89ed-378ddbed4d08/awx-task/1.log

2021-10-17T07:43:48.929344109+08:00 stderr F 2021-10-16 23:43:48,928 INFO     [ba8cf785179d40f19847119a0a9dfef2] awx.main.commands.run_callback_receiver Event processing is finished for Job 1090, sending notifications
2021-10-17T07:43:48.929927511+08:00 stderr F 2021-10-16 23:43:48,928 INFO     [ba8cf785179d40f19847119a0a9dfef2] awx.main.commands.run_callback_receiver Event processing is finished for Job 1090, sending notifications
2021-10-17T07:43:49.42528382+08:00 stderr F 2021-10-16 23:43:49,409 DEBUG    [ba8cf785179d40f19847119a0a9dfef2] awx.main.tasks job 1090 (running) finished running, producing 389 events.
2021-10-17T07:43:49.425323136+08:00 stderr F 2021-10-16 23:43:49,421 DEBUG    [ba8cf785179d40f19847119a0a9dfef2] awx.analytics.job_lifecycle job-1090 post run
2021-10-17T07:43:49.612110438+08:00 stderr F 2021-10-16 23:43:49,611 DEBUG    [ba8cf785179d40f19847119a0a9dfef2] awx.analytics.job_lifecycle job-1090 finalize run
2021-10-17T07:43:49.640764078+08:00 stderr F 2021-10-16 23:43:49,639 DEBUG    [ba8cf785179d40f19847119a0a9dfef2] awx.main.dispatch task d2b391fd-10bc-488d-9269-c4d6c18fbfed starting awx.main.tasks.update_inventory_computed_fields(*[2])
2021-10-17T07:43:49.683846539+08:00 stderr F 2021-10-16 23:43:49,683 DEBUG    [ba8cf785179d40f19847119a0a9dfef2] awx.main.models.inventory Going to update inventory computed fields, pk=2
2021-10-17T07:43:49.714765297+08:00 stderr F 2021-10-16 23:43:49,714 DEBUG    [ba8cf785179d40f19847119a0a9dfef2] awx.main.models.inventory Finished updating inventory computed fields, pk=2, in 0.029 seconds
2021-10-17T07:43:50.735156488+08:00 stderr F 2021-10-16 23:43:50,733 WARNING  [ba8cf785179d40f19847119a0a9dfef2] awx.main.dispatch job 1090 (error) encountered an error (rc=None), please see task stdout for details.
2021-10-17T07:43:50.736988429+08:00 stderr F 2021-10-16 23:43:50,736 DEBUG    [ba8cf785179d40f19847119a0a9dfef2] awx.main.dispatch task 2b73b673-6d40-4c65-b1a1-7588fd1a9c5b starting awx.main.tasks.handle_work_error(*['2b73b673-6d40-4c65-b1a1-7588fd1a9c5b'])
2021-10-17T07:43:50.738166086+08:00 stderr F 2021-10-16 23:43:50,737 DEBUG    [ba8cf785179d40f19847119a0a9dfef2] awx.main.tasks Executing error task id 2b73b673-6d40-4c65-b1a1-7588fd1a9c5b, subtasks: [{'type': 'job', 'id': 1090}]
2021-10-17T07:43:50.775480607+08:00 stderr F 2021-10-16 23:43:50,770 DEBUG    [ba8cf785179d40f19847119a0a9dfef2] awx.main.dispatch task 2b73b673-6d40-4c65-b1a1-7588fd1a9c5b starting awx.main.tasks.handle_work_success(*[])

mw-0 commented 2 years ago

@smullenrga have you tried the same but change your log sizes. They are 16k default.

smullenrga commented 2 years ago

@mw-0 I'm looking into the options for log sizes, I'm simply a consumer of our k8s cluster so will possibly have to work with the cluster admins to make any global changes. Will update once I've had a chance to change log settings. Thanks for the reply.

smullenrga commented 2 years ago

Cleaning up/deleting several earlier comments.

We're running Kubernetes and using the GELF driver to log to an ELK stack (configured via /etc/docker/daemon.json). AWX changed somewhere between 15 (worked fine in our environment) and 19.4 to create separate STDOUT lines for each event. In 19.4, very long STDOUT lines (>16K) are being generated which are being split up and then improperly reassembled somewhere. Per https://github.com/moby/moby/pull/22982 a 16K limit / split was put on STDOUT lines in Docker 1.13.

In our environment, after upgrading from AWX 15 to AWX 19.4, AWX Jobs break when a STDOUT line >16K is encountered because whatever is re-assembling these docker-split long lines is failing to put a carriage return on the reassembled line that goes back to AWX. As a result, you get the 16+K JSON object rebuilt as expected however, you also end up with the following JSON even log line appended to the end of the prior long line and the JSON parsing breaks.

As a result of the failed log parsing, the jobs are marked failed regardless of true status. 01-GoodEvents 02-BadEvents 03-ParsedFormattedStart 04-Truncated end of output

shanemcd commented 2 years ago

Can y'all take a look at https://github.com/ansible/awx/issues/11338#issuecomment-972708525 ? Wondering if this is the same thing.

daenney commented 2 years ago

@shanemcd Sure looks like it from my point of view. Unfortunately tweaking kubelet args isn't always an option (it's possible but very annoying with managed node pools etc. in certain clouds) so it would be nice to find a way around it. Ideally whatever is tailing/streaming the log needs to handle file rotation transparently.

smullenrga commented 2 years ago

@shanemcd This does not look like my issue, perhaps I should open a new issue. I get all of the logs from kubectl, and I get the full JSON output at the top of the job output after the job's errored out. My overall log is only like 700K and that's only because I'm intentionally generating event data's over 16K (which happens naturally with a windows gather facts when the event body contais all of the host vars on at least some of our systems). My issue really seems to be about the log lines being split over 16K and whatever's reassembling them not putting a carriage return on the line. In the "stack trace" at the top of the failed job in AWX, I see the full event items in JSON format, each on its own line UNLESS the event is over 16K, as soon as it crosses 16K for the event JSON object, the next event's JSON object is appended to the end of the 16+K line and that's the point at which AWX lists it as failed and the pretty/formatted output stops.

bschaeffer commented 2 years ago

From #11511:

FWIW, this does seem to be related to container runtime. In GKE, we recently upgraded to 1.20 which defaults to containerd at the same time we saw this error. When we rolled back to 1.19, it was also broken, but we realised it was still using containerd://1.4.8.

When we switched the runtime to docker://19.3.15, it actually fixed our problem.

So things are now in a working state, but it will be EOL for us in about 5 months.

kladiv commented 2 years ago

Hi, interesting. Containerd is the default CRI in most Kubernetes distributions so i hope this issue will be solved soon in AWX side 'cause it could affect all users in the next future. Thank you

Best, Claudio

Il giorno mar 18 gen 2022 alle ore 15:21 Braden Schaeffer < @.***> ha scritto:

From #11511 https://github.com/ansible/awx/issues/11511:

FWIW, this does seem to be related to container runtime. In GKE, we recently upgraded to 1.20 which defaults to containerd at the same time we saw this error. When we rolled back to 1.19, it was also broken, but we realised it was still using containerd://1.4.8.

When we switched the runtime to docker://19.3.15, it actually fixed our problem.

So things are now in a working state, but it will be EOL for us in about 5 months.

— Reply to this email directly, view it on GitHub https://github.com/ansible/awx/issues/9961#issuecomment-1015458743, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANYNFVYP2RTERBNKRW3NELUWVZPDANCNFSM43IX6QDQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

--

Ciao, Claudio

mail @ @.***

smullenrga commented 2 years ago

I think I've found my root cause - the GELF log driver in docker is not updated to handle the 16K log limit. Our enterprise cluster that AWX is on is configured to use the GELF driver to send logs to ELK and fails as noted above.

AWX on Docker Desktop with default configs works fine, no failure.

As soon as I change Docker Desktop to the GELF driver, I get the same failure.

Looking at the docker (moby) source code for the default jsonfilelog driver, it's updated to read PLogMetaData (which contains the partial log message info) and concatenate lines as needed.

The fluentd driver reads the metadata and passes it on to the log recipient.

The GELF driver has no processing of metadata or line concatenation logic from what I can see and therefore passes the bad partial messages through without any metadata needed for reassembly.

I don't know if AWX is written to handle the docker split log line metadata / reassembly itself or if it is expecting to receive the log lines already reassembled. I'm working on testing the fluentd driver to see if it breaks AWX as well. As far as I can tell, using the jsonfilelog log driver in docker will fix my issue but results in the problem of not being able to send logs to our logging systems as I'm required to do.

shanemcd commented 2 years ago

Going to close in favor of https://github.com/ansible/awx/issues/11338 since it pinpoints the underlying issue.

ansible / awx