ansible / awx

AWX provides a web-based user interface, REST API, and task engine built on top of Ansible. It is one of the upstream projects for Red Hat Ansible Automation Platform.
Other
13.91k stars 3.4k forks source link

K8S json parse error #14889

Open marianskrzypekk opened 7 months ago

marianskrzypekk commented 7 months ago

Please confirm the following

Bug Summary

image Error started to happen around 3 days ago, i tested it on: AWX 23.7.0 & 23.8.0 & 23.8.1

  ee_extra_env: |
    - name: RECEPTOR_KUBE_SUPPORT_RECONNECT
      value: disabled

Also doesn't help. Based on other similar bug reports i also tested watchers limit and max file limit. Problem happen only on k8s hosts, after adding receptor_kube_support_reconnect on disabled - job finish sucessfully, but error on web still persist. Based on some comment i also tested out awx-ee in most available versions, unfortunately without success. Automation_job pod logs are large, that might be a problem so next step i tried out was to increase container-log-max-size but it also didn't helped. Verbosity level on job also dosen't change anything.

AWX version

AWX 23.7.0 & 23.8.0 & 23.8.1 (tried on all 3)

Select the relevant components

Installation method

kubernetes

Modifications

no

Ansible version

on host ansible 2.9.6

Operating system

Debian

Web browser

Firefox, Chrome, Safari, Edge

Steps to reproduce

Run playbook on some large k8s cluster(i tried out 3 different k8s clusters)

Expected results

Job finished sucessfully also in web ui.

Actual results

Job finish but error on web persist - like on screen above.

Additional information

No response

AdityaVishwekar commented 7 months ago

Experiencing same issue, but with inventory sync -

kubectl get nodes NAME STATUS ROLES AGE VERSION Node1 Ready control-plane 543d v1.25.0 Node2 Ready 543d v1.25.0 Node3 Ready 543d v1.25.0 Node4 Ready 543d v1.25.0

kubelet --version [root@astdc-k8sawx01p ~]# kubelet --version Kubernetes v1.25.0

AWX UI Error's out

fosterseth commented 7 months ago

for these jobs that are ending in Error -- are your automation job pods completing successfully? you can disable pod cleanup by adding

  extra_settings:
  - setting: RECEPTOR_RELEASE_WORK
    value: "False"

to your AWX spec file (note only do this for debugging purposes!)

Are the pods in a Completed status? if you tail the logs of the job pod, do you see the zipfile contents?

marianskrzypekk commented 7 months ago

for these jobs that are ending in Error -- are your automation job pods completing successfully? you can disable pod cleanup by adding

  extra_settings:
  - setting: RECEPTOR_RELEASE_WORK
    value: "False"

to your AWX spec file (note only do this for debugging purposes!)

Are the pods in a Completed status? if you tail the logs of the job pod, do you see the zipfile contents?

for these jobs that are ending in Error -- are your automation job pods completing successfully? Are the pods in a Completed status? Yes - they end successfully - based on kubectl logs on automation_job i see full output and yes - they have completed status. if you tail the logs of the job pod, do you see the zipfile contents? Also yes, log ends with zipfile contents, job on machines also are done correctly.

TheRealHaoLiu commented 3 months ago

@marianskrzypekk base on the error message in your screenshot (next time please copy and paste) it seems like the data stream was cut mid "line" causing a malform message

please go to /api/v2/job/<job_id> and provide us with the result_traceback for further debugging

since this issue is open on Feb if this problem has resolved and/or unreproducible please close this issue