Job gets stuck randomly on a random task.

d3al commented 2 years ago

Please confirm the following

[X] I agree to follow this project's code of conduct.
[X] I have checked the current issues for duplicates.
[X] I understand that AWX is open source software provided for free and that I might not receive a timely response.

Bug Summary

I did a clean install of awx (21.1.0) 2 weeks ago via operator on a digital ocean VM 4cpu, 8g ram.

Install went fine, created the templates, inventories manually from an awx install that is still in production (13). Sync project files via git and I'm able to run jobs. I'm hitting this particular problem that happens all the time, so much so I don't trust it can complete the job and may require multiple attempts.

I have many templates, but three of them are on a daily schedule and the issue doesn't happen on just these and it certainly happens if the job runs from schedule or manually. The jobs seem to randomly get stuck, might be on the same task, maybe on a different task. Maybe the job completes or maybe it gets stuck on a pass on the inventory. These jobs run 3 machines at a time, when I could have say 20 in the inventory. It's common that the first few passes there are no issues but it can get stuck on a machine on a task. It seems random but there is a good chance the job will get stuck.

If I let the job run, it's in a waiting state for several hours and eventually cancels itself. The task doesn't seem to be anything interesting. It seems the job might take longer to complete compared to the existing awx deployment. So I'm not sure if there is a timeout problem or if awx is waiting for a response but never comes. All of these machines are remote yet I've not had this problem on the v13 deployment.

I'm not sure what to look for so I'm just raising the issue to get direction.

AWX version

21.1.0

Select the relevant components

[X] UI
[ ] API
[ ] Docs
[ ] Collection
[ ] CLI
[X] Other

Installation method

kubernetes

Modifications

no

Ansible version

No response

Operating system

Ubuntu 22.04

Web browser

No response

Steps to reproduce

N/A

Expected results

Job completes without pausing.

Actual results

Job gets stuck on a random task but it's not consistent. Could be another task in the job or it may complete the job just fine.

Additional information

No response

d3al commented 2 years ago

Poking around a bit.

kubectl exec -ti deploy/awx -c awx-task -- /bin/bash

I look at the log /var/log/tower/dispatcher.log

I see two errors.

1.

2022-06-10 17:42:11,356 ERROR    [9118a6b4cb024462860fa974127d7020] awx.main.dispatch.periodic PID:44 encountered an error while scheduling periodic tasks
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/periodic.py", line 37, in run
    self.run_pending()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/schedule/__init__.py", line 94, in run_pending
    self._run_job(job)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/schedule/__init__.py", line 147, in _run_job
    ret = job.run()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/schedule/__init__.py", line 466, in run
    ret = self.job_func()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/publish.py", line 86, in apply_async
    with pg_bus_conn() as conn:
  File "/usr/lib64/python3.9/contextlib.py", line 119, in __enter__
    return next(self.gen)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/__init__.py", line 50, in pg_bus_conn
    conn = psycopg2.connect(dbname=conf['NAME'], host=conf['HOST'], user=conf['USER'], password=conf['PASSWORD'], port=conf['PORT'], **conf.get("OPTIONS", {}))
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/psycopg2/__init__.py", line 126, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: could not translate host name "awx-postgres" to address: Name or service not known

2022-06-10 17:52:03,407 WARNING  [5bfc05f82254461185ccb17274b8a02d] awx.main.dispatch PID:143 job 67 (canceled) was canceled (rc=1)

2.

2022-06-17 13:17:36,647 WARNING  [5316cdd5244c4fbdb12adefbcb93a960] awx.main.dispatch PID:4942 job 222 (error) encountered an error (rc=None), please see task stdout for details.
2022-06-17 17:48:49,311 ERROR    [763b1bebc6904b8ca3a2dd0578cbf79f] awx.main.dispatch PID:5742 Worker failed to run task awx.main.tasks.system.purge_old_stdout_files(*[], **{}
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/task.py", line 90, in perform_work
    result = self.run_callable(body)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/task.py", line 65, in run_callable
    return _call(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/system.py", line 326, in purge_old_stdout_files
    for f in os.listdir(settings.JOBOUTPUT_ROOT):
FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/awx/job_status'
2022-06-17 20:32:31,900 WARNING  [20450ed92f3b4f119f8cfe66517489c4] awx.main.dispatch PID:5742 job 226 (canceled) was canceled (rc=1)

I've checked /var/lib/awx/ and I don't see a job_status directory or file. I created a directory and ran the job again. I will check this log if the job gets stuck again, I have a feeling it's related to this folder not existing and the job fails. I will update soon.

d3al commented 2 years ago

I'm getting this series of logs on the host machine which is directly timed when a job fails.

I've also increased the job slicing from 1 to 3.

When I get these logs, I notice at least one job fails while the other 2 continue to run. I've only checked this twice but more jobs might fail. I've updated this server, rebooted using the latest kernel.

Jun 18 19:03:15 cms kernel: [30145.232828] IPv6: ADDRCONF(NETDEV_CHANGE): veth1872d1a1: link becomes ready
Jun 18 19:03:15 cms kernel: [30145.232887] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Jun 18 19:03:15 cms systemd-networkd[579]: veth1872d1a1: Link UP
Jun 18 19:03:15 cms systemd-networkd[579]: veth1872d1a1: Gained carrier
Jun 18 19:03:15 cms kernel: [30145.239809] cni0: port 9(veth1872d1a1) entered blocking state
Jun 18 19:03:15 cms kernel: [30145.239813] cni0: port 9(veth1872d1a1) entered disabled state
Jun 18 19:03:15 cms kernel: [30145.240258] device veth1872d1a1 entered promiscuous mode
Jun 18 19:03:15 cms kernel: [30145.240535] cni0: port 9(veth1872d1a1) entered blocking state
Jun 18 19:03:15 cms kernel: [30145.240538] cni0: port 9(veth1872d1a1) entered forwarding state
Jun 18 19:03:15 cms systemd[1]: run-containerd-runc-k8s.io-ae86762ce1d0f7862628ba02619f540fbf5ff97543cfdb9d6d8b20ba90a66830-runc.KoGMEH.mount: Deactivated successfully.
Jun 18 19:03:16 cms k3s[752]: I0618 19:03:16.284416     752 topology_manager.go:200] "Topology Admit Handler"
Jun 18 19:03:16 cms networkd-dispatcher[692]: WARNING:Unknown index 15 seen, reloading interface list
Jun 18 19:03:16 cms systemd-networkd[579]: vethae97a41a: Link UP
Jun 18 19:03:16 cms systemd-networkd[579]: vethae97a41a: Gained carrier
Jun 18 19:03:16 cms systemd-udevd[27304]: Using default interface naming scheme 'v249'.
Jun 18 19:03:16 cms kernel: [30145.728340] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Jun 18 19:03:16 cms kernel: [30145.737471] cni0: port 10(vethae97a41a) entered blocking state
Jun 18 19:03:16 cms kernel: [30145.737477] cni0: port 10(vethae97a41a) entered disabled state
Jun 18 19:03:16 cms kernel: [30145.742672] device vethae97a41a entered promiscuous mode
Jun 18 19:03:16 cms kernel: [30145.743290] cni0: port 10(vethae97a41a) entered blocking state
Jun 18 19:03:16 cms kernel: [30145.743294] cni0: port 10(vethae97a41a) entered forwarding state
Jun 18 19:03:16 cms k3s[752]: I0618 19:03:16.696269     752 topology_manager.go:200] "Topology Admit Handler"
Jun 18 19:03:16 cms networkd-dispatcher[692]: WARNING:Unknown index 16 seen, reloading interface list
Jun 18 19:03:16 cms systemd-networkd[579]: veth613d9f42: Link UP
Jun 18 19:03:16 cms systemd-networkd[579]: veth613d9f42: Gained carrier
Jun 18 19:03:16 cms kernel: [30146.166464] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Jun 18 19:03:16 cms kernel: [30146.166727] cni0: port 11(veth613d9f42) entered blocking state
Jun 18 19:03:16 cms kernel: [30146.166731] cni0: port 11(veth613d9f42) entered disabled state
Jun 18 19:03:16 cms kernel: [30146.166802] device veth613d9f42 entered promiscuous mode
Jun 18 19:03:16 cms kernel: [30146.166845] cni0: port 11(veth613d9f42) entered blocking state
Jun 18 19:03:16 cms kernel: [30146.166847] cni0: port 11(veth613d9f42) entered forwarding state
Jun 18 19:03:17 cms systemd-networkd[579]: veth1872d1a1: Gained IPv6LL
Jun 18 19:03:17 cms systemd[1]: run-containerd-runc-k8s.io-e1395273997a3000c09a71e2c66c9bc06ca7f50556030b1a9f5a54ac38142879-runc.BCNgDM.mount: Deactivated successfully.
Jun 18 19:03:17 cms systemd-networkd[579]: vethae97a41a: Gained IPv6LL
Jun 18 19:03:18 cms systemd-networkd[579]: veth613d9f42: Gained IPv6LL

d3al commented 2 years ago

I've updated k3s to 1.24.1 which is the latest as of today and still getting stuck jobs.

I ran some jobs and as I was watching the job run through tasks, I was also watching the syslog on the host. When a job got stuck on a task and didn't progress, there was no syslog output. It seems like there is a timeout occurring. Is there a log I should be watching? Anyway to enable debug logging?

r0b1ndot commented 1 year ago

Hi there, I also did some high load tests and performed a total amount of 3000 jobs in 2 hours. Also with a variation that I have the mixture of short and long-running tasks.

I also noticed that kind of DNS Issues sometimes.

Is there any possibility why there is no retry? DNS Issues can sometimes flap, so what I would really appreciate is if AWX can do a retry if the DNS Resolution is buggy. Is this possible?

So please have a look and provide an alternative how to mitigate this issue. Thanks!

ansible / awx