Playbooks terminated unexpectedly after 4 hours

spireob commented 2 years ago

Please confirm the following

[X] I agree to follow this project's code of conduct.
[X] I have checked the current issues for duplicates.
[X] I understand that AWX is open source software provided for free and that I might not receive a timely response.

Summary

Playbooks running longer than 4 hours are terminated unexpectedly. The Jobs finish with error state in GUI. The exit code "137" Same issue : closed without resolution. Tested on versions: AWX 19.4.0 and AWX 20.0.0

AWX version

20.0.0

Select the relevant components

[ ] UI
[x] API
[ ] Docs

Installation method

kubernetes

Modifications

yes

Ansible version

core 2.13.0.dev0

Operating system

CentOS8

Web browser

Chrome

Steps to reproduce

hosts: all gather_facts: no

tasks:
- name: Run Job shell: | while(1){ Write-Output "." start-sleep -seconds 1800 } args: executable: /usr/bin/pwsh async: 43200 poll: 900 register: pwsh_output_job ignore_errors: true

Expected results

Playbook completes successfully

Actual results

Container running the job is terminated after running for 4 hours

Additional information

exitCode: 137

spireob commented 2 years ago

Anything new in this topic?

meis4h commented 2 years ago

Hi, we are also seeing this issue on K3s using AWX 19.5.0 and 21.0.0. A few things we observed looking at this:

Jobs fail after the last task over the 4 hour mark has completed
- for example a wait_for task with a timeout of 8 hours causes the job to fail with error after 8 hours, a 5 hour task fails after 5 hours
The limit seems to be pretty much exactly 4 hours as jobs running 3h 50min complete successfully
Jobs are not continued in the background as the Pod is deleted instantly after the job errors
Also happens with job timeout set in the job template

kladiv commented 2 years ago

@spireob maybe you can check below

https://github.com/ansible/awx/issues/9961
https://github.com/ansible/awx/issues/11338 it could be related to that issues.

I've the same issue of 4hrs jobs end with errors (in k3s)

d-rupp commented 2 years ago

We also encounter an issue like this regularly. It seems awx-task just decides that the job is done and kills the pod.

This is what i find in the awx-task log:

2022-05-19 13:20:17,503 INFO     [3abf5855276042c595518de57f670161] awx.main.commands.run_callback_receiver Event processing is finished for Job 15161, sending notifications
2022-05-19 13:20:17,503 INFO     [3abf5855276042c595518de57f670161] awx.main.commands.run_callback_receiver Event processing is finished for Job 15161, sending notifications
2022-05-19 13:20:18,107 DEBUG    [3abf5855276042c595518de57f670161] awx.main.tasks.jobs job 15161 (running) finished running, producing 382 events.
2022-05-19 13:20:18,107 DEBUG    [3abf5855276042c595518de57f670161] awx.main.dispatch task c950003b-4c05-49d1-9b45-43e671098931 starting awx.main.tasks.system.handle_success_and_failure_notifications(*[15161])
2022-05-19 13:20:18,109 DEBUG    [3abf5855276042c595518de57f670161] awx.analytics.job_lifecycle job-15161 post run
2022-05-19 13:20:18,238 DEBUG    [3abf5855276042c595518de57f670161] awx.analytics.job_lifecycle job-15161 finalize run
2022-05-19 13:20:18,342 WARNING  [3abf5855276042c595518de57f670161] awx.main.dispatch job 15161 (error) encountered an error (rc=None), please see task stdout for details.
2022-05-19 13:20:18,345 DEBUG    [3abf5855276042c595518de57f670161] awx.main.tasks.system Executing error task id ecbb37f9-809d-4317-9d01-af93846de8d6, subtasks: [{'type': 'job', 'id': 15161}]

All the while the task output just says "canceled". If there is anything i can help analyse this please tell me what to do.

It is not related to the linked issues above.

*edit: sorry, missing data about the system

AWX: 21.0.0 running on K3S v1.23.6

adpavlov commented 2 years ago

exactly same issue

meis4h commented 2 years ago

can confirm that this also happens on RedHat Ansible Automation Platform 2.1 on OpenShift 4.8

adpavlov commented 2 years ago

can confirm that this also happens on RedHat Ansible Automation Platform 2.1 on OpenShift 4.8

Have you opened a case to RedHat?

meis4h commented 2 years ago

can confirm that this also happens on RedHat Ansible Automation Platform 2.1 on OpenShift 4.8

Have you opened a case to RedHat?

yes but no news yet

adpavlov commented 2 years ago

can confirm that this also happens on RedHat Ansible Automation Platform 2.1 on OpenShift 4.8

Have you opened a case to RedHat?

yes but no news yet

Okay, could you please keep us posted on this case status? Also there should be SLA for paid subscription. Issue is quite critical as for me.

meis4h commented 2 years ago

Okay, could you please keep us posted on this case status? Also there should be SLA for paid subscription. Issue is quite critical as for me.

Will do. In the meantime we could largely work around the issue by splitting the job into multiple separate jobs connected via workflow.

kiril18 commented 2 years ago

I got a similar problem today, after four hours the task fell.

adpavlov commented 2 years ago

@spireob maybe you can check below

https://github.com/ansible/awx/issues/9961

https://github.com/ansible/awx/issues/11338

it could be related to that issues.

I've the same issue of 4hrs jobs end with errors (in k3s)

For my installation I don't believe it's k3s related as I have a 500 Mb limit for logs. More than that I don't even see log files created under /var/log/pods/, just empty folders.

Also I'm using custom EE built with ansible 2.9 as suggested in some @AlanCoding repo, so I believe the issue is not related to ansible-runner, but related to awx-task that seems like have some timeout for waiting output from a task.

cmatsis commented 1 year ago

Same issue on AWX: 21.0.0 running on K3S v1.23.6 :( any workaround to this problem?

stefanpinter commented 1 year ago

same problem with awx 21.1.0 & k3s v1.21.7+k3s1 for now, where I "know" that the last task ends like it should, I re-run the playbook with the remaining tags only

well, I can only assume that the last task ended without error, as I don't see an "ok", "changed" or "failed"....

adpavlov commented 1 year ago

Okay, could you please keep us posted on this case status? Also there should be SLA for paid subscription. Issue is quite critical as for me.

Will do. In the meantime we could largely work around the issue by splitting the job into multiple separate jobs connected via workflow.

@meis4h Is there any news from support?

cmatsis commented 1 year ago

Does this problem occur in the paid version and there is no solution? There are few people working more than 4 hours?

3zAlb commented 1 year ago

We are also having this issue running the latest AWX, k3s, and docker backend. Container log size is set to 500mb and allowed to have 4 files. (Single log file is generated and gets nowhere near 500mb)

This is a pretty big show stopper for long running maintenance playbooks.

Can we get an update on this? This issue has been open since February and i've seen numerous closed issues with the same problem.

NadavShani commented 1 year ago

We are also having this issue running the latest AWX, k3s, and docker backend. Container log size is set to 500mb and allowed to have 4 files. (Single log file is generated and gets nowhere near 500mb)

This is a pretty big show stopper for long running maintenance playbooks.

Can we get an update on this? This issue has been open since February and i've seen numerous closed issues with the same problem.

same here

StefanSpecht commented 1 year ago

We have exactly the same issue.

adpavlov commented 1 year ago

Okay, could you please keep us posted on this case status? Also there should be SLA for paid subscription. Issue is quite critical as for me.

Will do. In the meantime we could largely work around the issue by splitting the job into multiple separate jobs connected via workflow.

@meis4h Is there any news from support?

@meis4h could you please update?

Also lets probably call active developers like @AlanCoding 😅

meis4h commented 1 year ago

@adpavlov sadly there is nothing new to report 😕

lals1 commented 1 year ago

exactly same issue

auracz commented 1 year ago

exactly same issue

backaf commented 1 year ago

Same issue here. Upgraded to the latest AWX version this morning but it still occurs. Use case is running restic prune commands which take a long time to complete.

Env:

on-prem k8s 1.23.5
awx operator 0.28.0
awx 21.5.0

bartowl commented 1 year ago

Same here. Does the apparent 4h limit affect the entire job template, or just a single task? Maybe using async tasks would be a way around here? What is bad is that the automation pod gets removed with all information what has happened, and the task container just throws EOF handler like with normal job end:

2022-09-07 17:14:27,661 INFO     [3c69d3938c974b689cc1cdb49acadf04] awx.main.commands.run_callback_receiver Starting EOF event processing for Job 2076
2022-09-07 17:14:27,661 INFO     [3c69d3938c974b689cc1cdb49acadf04] awx.main.commands.run_callback_receiver Starting EOF event processing for Job 2076
2022-09-07 17:14:28,591 DEBUG    [3c69d3938c974b689cc1cdb49acadf04] awx.main.tasks.jobs job 2076 (running) finished running, producing 80 events.
2022-09-07 17:14:28,595 DEBUG    [3c69d3938c974b689cc1cdb49acadf04] awx.analytics.job_lifecycle job-2076 post run
2022-09-07 17:14:28,937 DEBUG    [3c69d3938c974b689cc1cdb49acadf04] awx.analytics.job_lifecycle job-2076 finalize run
2022-09-07 17:14:28,957 DEBUG    [3c69d3938c974b689cc1cdb49acadf04] awx.main.dispatch task f1c62c03-fdca-45d8-8363-a66710f19910 starting awx.main.tasks.system.update_inventory_computed_fields(*[3])
2022-09-07 17:14:28,977 DEBUG    [3c69d3938c974b689cc1cdb49acadf04] awx.main.models.inventory Going to update inventory computed fields, pk=3
2022-09-07 17:14:28,999 DEBUG    [3c69d3938c974b689cc1cdb49acadf04] awx.main.models.inventory Finished updating inventory computed fields, pk=3, in 0.022 seconds
2022-09-07 17:14:29,144 WARNING  [3c69d3938c974b689cc1cdb49acadf04] awx.main.dispatch job 2076 (error) encountered an error (rc=None), please see task stdout for details.
2022-09-07 17:14:29,200 DEBUG    [3c69d3938c974b689cc1cdb49acadf04] awx.main.dispatch task 4671db40-b07f-47e5-8810-bfb76ee45d8d starting awx.main.tasks.system.handle_work_error(*['4671db40-b07f-47e5-8810-bfb76ee45d8d'])
2022-09-07 17:14:29,201 DEBUG    [3c69d3938c974b689cc1cdb49acadf04] awx.main.tasks.system Executing error task id 4671db40-b07f-47e5-8810-bfb76ee45d8d, subtasks: [{'type': 'job', 'id': 2076}]
2022-09-07 17:14:29,224 DEBUG    [3c69d3938c974b689cc1cdb49acadf04] awx.main.dispatch task 4671db40-b07f-47e5-8810-bfb76ee45d8d starting awx.main.tasks.system.handle_work_success(*[])
2022-09-07 17:14:29,224 DEBUG    [3c69d3938c974b689cc1cdb49acadf04] awx.main.dispatch task aefe7820-cf95-4baf-9cfe-f25cf5d5cde4 starting awx.main.scheduler.tasks.run_task_manager(*[])
2022-09-07 17:14:29,224 DEBUG    [3c69d3938c974b689cc1cdb49acadf04] awx.main.scheduler Running task manager.
2022-09-07 17:14:29,238 DEBUG    [3c69d3938c974b689cc1cdb49acadf04] awx.main.scheduler Starting Scheduler
2022-09-07 17:14:29,248 DEBUG    [3c69d3938c974b689cc1cdb49acadf04] awx.main.dispatch task 83ec5dcb-f7c1-4fdc-829e-2bfb1c72e34b starting awx.main.scheduler.tasks.run_task_manager(*[])
2022-09-07 17:14:29,248 DEBUG    [3c69d3938c974b689cc1cdb49acadf04] awx.main.scheduler Running task manager.
2022-09-07 17:14:29,262 DEBUG    [3c69d3938c974b689cc1cdb49acadf04] awx.main.scheduler Not running scheduler, another task holds lock
2022-09-07 17:14:29,272 DEBUG    [3c69d3938c974b689cc1cdb49acadf04] awx.main.scheduler Finishing Scheduler

This does not differ from normal job termination due to task error. The thing is, that last task that took so long has no json nor output available. What is weird though, the long running task performed its action and AWX could just run next task. So startint with start-at-task would allow to continue here. I wonder where does the time limit hit.

bartowl commented 1 year ago

Well, async tasks did not help at all... 4:01 and job template fails. Seems that the automation container gets just terminated, but I would not blame k8s for that, since there are many different distributions (i use k3d) and all fail precisely after 4h. My last job with async_status poll every 5 minutes terminated at exact 4:01:22 after 7 retries - could not perform next retry.... I will try to trace that from inside automation pod as well as from k8s perspective... check also source code for that... this is really weird.

bartowl commented 1 year ago

Hmm - I quess I found the reason - kubelet by default terminates streaming connections after 4h. Streaming automation pod logs is a streaming connection. See https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/ and search for "--streaming-connection-idle-timeout". I once again checked all messages in this Issue, and everyone has a variant of k3s / k3d. This options seems to only work for docker backend. K3d uses containerd-shim-runc-v2 and I have found an Issue for containerd exactly regarding this https://github.com/containerd/cri/issues/1057 One might try to increase this timeout somehow... maybe this will be a workaround... But a permanent solution would be to reimplement AWX part that uses those streaming connections to re-try closed ones instead of assuming that the connection got closed == container finished its work. I cannot track that part in the source code :/

adpavlov commented 1 year ago

I would call you a genius:) However, I set this timeout of k8s to 0 (meaning no timeout) in my k3s installation and pod survived 4 hours 27 minutes! But still resulted in Error...

bartowl commented 1 year ago

Sadly yes, as k3s with containerio backend seems to ignore this parameter and you would need to switch to docker backend somehow. I would also if I only knew how, and adding --docker to kubelet arguments what should do the trick does not give any usefull effect besides log entries that this is legacy and will be ignored (used --kubelet-arg=docker@server:0 actually) See also https://github.com/k3s-io/k3s/issues/1936 ... But as already written - the problem is not that kubelet/k3s do not support removing this timeout, the main Problem is, that AWX does not try to re-connect to logs output after the stream gets closed. There should be for sure a retry mechanism and proper verification if the pod just disconnected or is really done and closed. I hope this will be fixed with this Issue. All other options are just vague workarounds... Interestingly this seems also to be the problem with the commercial AWX version, or can it handle jobs running longer than 4h on k3s/k3d with crio backend?

adpavlov commented 1 year ago

I'm using docker backend actually and passed --kubelet-arg=streaming-connection-idle-timeout=0 to k3s, but still no luck

bartowl commented 1 year ago

@adpavlov I hope you do not confuse running k3s under docker as k3d, and docker backend mode for k3s. From what I have seen, when k3s is running in docker mode, all pods are created as docker containers and can be directly seen in docker ps output. I could not manage to get it to that point with k3d even after mounting docker socket in the k3d server container :/ But will keep trying :)

adpavlov commented 1 year ago

for sure I'm using k3s with --docker flag.

kubectl get no -o wide | awk '{print $5, $NF}'
VERSION CONTAINER-RUNTIME
v1.24.4+k3s1 docker://20.10.12
v1.24.4+k3s1 docker://20.10.12

And snippet from k3s.service:

ExecStart=/usr/local/bin/k3s \
    server \
    '--docker' \
    '--disable=traefik' \
    '--write-kubeconfig-mode' \
    '--kubelet-arg=eviction-hard=imagefs.available<1%,nodefs.available<1%' \
    '--kubelet-arg=eviction-minimum-reclaim=imagefs.available=1%,nodefs.available=1%' \
    '--kubelet-arg=image-gc-high-threshold=97' \
    '--kubelet-arg=image-gc-low-threshold=95' \
    '--kubelet-arg=image-gc-low-threshold=95' \
    '--kubelet-arg=container-log-max-size=100Mi' \
    '--kubelet-arg=container-log-max-files=5' \
    '--kubelet-arg=streaming-connection-idle-timeout=0' \

bartowl commented 1 year ago

that is bad, i hoped this will help as in https://github.com/containerd/cri/issues/1057#issue-414329803 but this post references actually kubernetes from version v1.10.13 which is looong time not valid anymore. The entire mentioned section seems to got rid, as in current version the kubelet.go looks totally different. Will look for other possible reasons.

bartowl commented 1 year ago

kowever in the current version i find mentions of streamingConnectionIdleTimeout at many places, so this timeout should be configurable and not only docker-backend-specific, as this block vanished... (see https://github.com/kubernetes/kubernetes/blob/release-1.25/pkg/kubelet/kubelet.go#L499) and search for other occurences... Will try instead of disabling it to double it. running now with streaming-connection-idle-timeout=28800s (without unit it thrown errors:

WARN[0013] warning: encountered fatal log from node k3d-awx-server-0 (retrying 0/10): ▒time="2022-09-13T21:07:14Z" level=fatal msg="kubelet exited: failed to parse kubelet flag: invalid argument \"28800\" for \"--streaming-connection-idle-timeout\" flag: time: missing unit in duration \"28800\""

will see in 4h...

bartowl commented 1 year ago

Setting the timeout higher might however have a chance to work, as setting it to 0 causes it to be set to default 4h again: see https://github.com/kubernetes/kubernetes/blob/release-1.25/pkg/kubelet/apis/config/v1beta1/defaults.go#L116

kladiv commented 1 year ago

@bartowl thanks for you tests.
Please keep us informed if --kubelet-arg=streaming-connection-idle-timeout=<VERY_HIGH_VALUE>s works

adpavlov commented 1 year ago

I have weird results with --kubelet-arg=streaming-connection-idle-timeout=0

Following task checks date every 30 minutes and playbook resulted in error right after 4 hours

    - name: Test
      ansible.builtin.shell: |
        date
      retries: 48
      delay: 1800

However, just wait_for task with 5 hours timeout was running exactly 5 hours, but again exited with error state

    - name:Test
      ansible.builtin.wait_for:
        timeout: 18000

bartowl commented 1 year ago

extending this to 28800s did not help, first interaction after 4h breaks the job. The 5 hours is just the first that happened after 4h...

kurokobo commented 1 year ago

I have no idea if this is helpful since this issue can't be reproduced on my side and have not digged into deeper, but the containerd also has stream_idle_timeout with 4 hours as default: https://github.com/containerd/containerd/blob/main/docs/cri/config.md#full-configuration

  # stream_idle_timeout is the maximum time a streaming connection can be
  # idle before the connection is automatically closed.
  # The string is in the golang duration format, see:
  #   https://golang.org/pkg/time/#ParseDuration
  stream_idle_timeout = "4h"

If you're on K3s, you can see this:

$ sudo $(which k3s) crictl info | jq '.config.streamIdleTimeout'
"4h0m0s"

You can modify this value by:

# Copy existing config.toml as config.toml.tmpl
sudo cp /var/lib/rancher/k3s/agent/etc/containerd/{config.toml,config.toml.tmpl}

# Append 'stream_idle_timeout = "<value>"' under '[plugins.cri]'
sudo vi /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl
# [plugins.cri]
#   stream_server_address = "127.0.0.1"
#   stream_server_port = "10010"
#   stream_idle_timeout = "10h"     👈👈👈

# Restart K3s
sudo systemctl restart k3s

$ sudo $(which k3s) crictl info | jq '.config.streamIdleTimeout'
"10h"

F.Y.I., In not K3s but AKS, the default idle timeout is extremely short (4min), so I made a dirty workaround: https://github.com/ansible/awx/issues/12530#issuecomment-1192616101

If this issue occurs with long idling for log stream (means no logs sent from ee for a long time), the same workaround could be used I guess. I think this issue should be solved by the Ansible Runner or Receptor.

bartowl commented 1 year ago

Thanks @kurokobo for sharing this information. In my setup indeed, despite running k3s server with --kubelet-arg=streaming-connection-idle-timeout=28800s, this value also is set to 4h:

/ # crictl info | grep Idle
    "streamIdleTimeout": "4h0m0s",

But as far as I understand, this is a timeout, when no new line in pod logs is generated over a certain time. My playbook uses either async_status to periodically poll for progress (therefore generates log lines) or pause: minutes=10 that is just repeated with with_sequence: count=300... So I guess this is not the case. Nevertheless I will try to increase this one as well and see when something changes...

What makes me unhappy is the comment in this like: https://github.com/kubernetes/kubernetes/blob/v1.25.0/cmd/kubelet/app/options/options.go#L451 saying:

Maximum time a streaming connection can be idle before the connection is automatically closed. 0 indicates no timeout. Example: '5m'. Note: All connections to the kubelet server have a maximum duration of 4 hours.

Does this mean, that despite setting of this parameter there is another hardcoded maximum duration of 4h?

I guess it is time to focus on the AWX part and implementing a retry mechanism instead of increasing timeouts, which is from security perspective also not recommended...

bartowl commented 1 year ago

Ok, I have bad news - the 4h timeout is really hardcoded in kubelet. There is no way around it: See: https://github.com/kubernetes/kubernetes/blob/v1.25.1/pkg/kubelet/server/server.go#L161 and https://github.com/kubernetes/kubernetes/issues/104595 and https://github.com/kubernetes/kubernetes/pull/104735/files

The only way to get over 4h is to implement retry mechanism in AWX. My python knowledge is way too little to do that :( I Tried to find the spot in code where it spawns the container and how it fetches its output, but got lost with the receptor, callbacks and stuff. This might be not so easy to implement, but I cannot imagine that such product as AWX does not allow a job to run over 4 hours.

nicolasbouchard-ubi commented 1 year ago

Just to weight in, we have the exact same problem. We have a long-running task that poll an API to fetch the status of builds that can take more than 4 hours to complete.

This is a big disapointment for us and makes AWX unusable for our use cases. Any workaround on how to implement such task would be greatly appreciated. I would be willing to contribute to solve the problem in AWX, but would need help to start off.

adpavlov commented 1 year ago

so I made a dirty workaround: #12530 (comment)

This didn't help in k3s...

sylvain-de-fuster commented 1 year ago

Hello,

As many here, we have the same behaviour on our side. (AWX 21.5.0 with k3s) The informations given before don't give a good perspective for a quick resolution.

We have some issues in our migration tests but this one is on top. We don't have a lot of long duration jobs but they are very important.

Is there anybody with a workaround for long duration tasks ? How do you proceed in the meantime ?

Thank you all.

bartowl commented 1 year ago

The only workaround that worked for me was to create a workflow, and split the task in multiple jobs. It does not even require multiple job templates if you work smart with tags, marking some tasks with tag like step1, next with step2 and so on, and than include the same job template in workflow multiple times, each time with different tag. Passing variables between different steps can be realised with ansible.builtin.set_stats, yet this is still cumbersome and problematic with single task that might run longer than 4h. For such a single task you have to use poll: 0, async: xxx, pass registered variable via set_stats and from optionally next step in workflow query the progress with async_status...

This is doable, but the only way to get around this is to redesign the part where the automation container is started. Instead reading its output as it is now, it has to be read in a kind of while true loop, until the container really finishes. Now, the container gets aborted when the https connection to kubectl gets disconnected.

sylvain-de-fuster commented 1 year ago

The only workaround that worked for me was to create a workflow, and split the task in multiple jobs. It does not even require multiple job templates if you work smart with tags, marking some tasks with tag like step1, next with step2 and so on, and than include the same job template in workflow multiple times, each time with different tag. Passing variables between different steps can be realised with ansible.builtin.set_stats, yet this is still cumbersome and problematic with single task that might run longer than 4h. For such a single task you have to use poll: 0, async: xxx, pass registered variable via set_stats and from optionally next step in workflow query the progress with async_status...

This is doable, but the only way to get around this is to redesign the part where the automation container is started. Instead reading its output as it is now, it has to be read in a kind of while true loop, until the container really finishes. Now, the container gets aborted when the https connection to kubectl gets disconnected.

Thanks for your reply. This is a very interesting workaround. We will experiment with that way and see if this is compatible and not too painful for our users's workcases.

arcsurf commented 1 year ago

@adpavlov sadly there is nothing new to report 😕

Hi adpavlov, before all, thank you for sharing your experience. I was wondering if maybe you get any workaround or answer. I keep trying but I can't find the solution. Now I'm using CRIO-O as container runtime, I see on other post that somebody tried with Docker runtime too and it didn't work. Thank you.

adpavlov commented 1 year ago

Unfortunately not. All proposed workarounds simply have no positive effect. Probably @meis4h got some response from redhat support? I bet that SLA already violated:)

arcsurf commented 1 year ago

I'm sorry @adpavlov, Yes I was trying to ask to @meis4h. Maybe @meis4h got some from Redhat.

Thank you everyone.

bartowl commented 1 year ago

after looking more deeply into this issue, AWX uses receptor to handle running k8s pods. It might be, that the fix that we need is either by calling receptor in a different way, or even the issue should be reported against receptor. In particular, receptor is being called from run_until_complete method, which submits the work request to receptor, and queries for status. It also allocates some sockets for bidirectional communication in _run_internal and this is what I'm afraid is running into timeout.

So basically, AWX uses external projects like receptor or kubernetes, kubernetes declared static 4h hardcoded timeouts, and receptor seems to be breaking off after this time. Now the big question is - who should fix this issue and in which part of code? Is it receptor that needs fixing, or maybe the way AWX uses it?

One has to consider, that AWX needs bi-directional communication with running automation pod for example in order to interactively pass passwords and so on. On the other hand, it should implement some re-attach mechanism after the connection breakes, the pod has to run even if the connection brakes. This addresses the reaper not to reap such detached pods. Also, once AWX re-connects to running pod, it needs to figure out till when it read the output previously, and continue from that moment in order to update web pages looking at the progress without skipping/doubling tasks... This is way beyond trivial task, at least for what it looks to me.

ansible / awx