ansible / awx

AWX provides a web-based user interface, REST API, and task engine built on top of Ansible. It is one of the upstream projects for Red Hat Ansible Automation Platform.
Other
14.03k stars 3.42k forks source link

Job continuosly fails on additional replica/instance of AWX in Kubernetes Cluster #5566

Closed vibinm closed 4 years ago

vibinm commented 4 years ago
ISSUE TYPE

Job continuosly fails on additional replica/instance of AWX in Kubernetes Cluster

SUMMARY

Jobs scheduled/manually ran on the second AWX instance (awx-1) (replica) hosted on the Kubernetes cluster fails continuosly.

ENVIRONMENT
STEPS TO REPRODUCE

Deploy AWX on kubernetes cluster and scale to two or more replicas.

Now try scheduling jobs on the scaled instance, for example awx-1

EXPECTED RESULTS

Jobs are expected run scheduled and run successfully.

ACTUAL RESULTS
ADDITIONAL INFORMATION

Jobs fail with this message in the web,

Task was marked as running in Tower but was not present in the job queue, so it has been marked as failed.

Nd in the task container logs, these are the errors.

2019-12-23 12:06:49,341 DEBUG awx.main.dispatch publish awx.main.tasks.cluster_node_heartbeat(5bc7544e-d5c7-44ac-9df0-335bc508b3b8, queue=awx-1) [2019-12-23 12:06:49,341: DEBUG/Process-1] publish awx.main.tasks.cluster_node_heartbeat(5bc7544e-d5c7-44ac-9df0-335bc508b3b8, queue=awx-1) 2019-12-23 12:06:49,501 DEBUG awx.main.models.mixins No credential configured to post back webhook status, skipping. 2019-12-23 12:06:49,501 ERROR awx.main.dispatch job 946 (failed) is no longer running; reaping 2019-12-23 12:06:49,503 DEBUG awx.main.dispatch delivered 5bc7544e-d5c7-44ac-9df0-335bc508b3b8 to worker[182] qsize 0 2019-12-23 12:06:49,504 DEBUG awx.main.dispatch task 5bc7544e-d5c7-44ac-9df0-335bc508b3b8 starting awx.main.tasks.cluster_node_heartbeat([]) 2019-12-23 12:06:49,505 DEBUG awx.main.tasks Cluster node heartbeat task. 2019-12-23 12:06:49,523 DEBUG awx.main.dispatch publish awx.main.tasks.awx_k8s_reaper(e3da2008-852d-4c61-ae0e-5cf7f9212f10, queue=awx-1) [2019-12-23 12:06:49,523: DEBUG/Process-1] publish awx.main.tasks.awx_k8s_reaper(e3da2008-852d-4c61-ae0e-5cf7f9212f10, queue=awx-1) 2019-12-23 12:06:49,537 DEBUG awx.main.dispatch task 5bc7544e-d5c7-44ac-9df0-335bc508b3b8 is finished 2019-12-23 12:06:49,538 DEBUG awx.main.dispatch delivered e3da2008-852d-4c61-ae0e-5cf7f9212f10 to worker[183] qsize 0 2019-12-23 12:06:49,540 DEBUG awx.main.dispatch task e3da2008-852d-4c61-ae0e-5cf7f9212f10 starting awx.main.tasks.awx_k8s_reaper([]) 2019-12-23 12:06:49,562 DEBUG awx.main.dispatch task e3da2008-852d-4c61-ae0e-5cf7f9212f10 is finished

Please let me know in case you need further details to help identifying the issue/fixing it.

Regards, Vibin

wenottingham commented 4 years ago

How did you scale to additional replicas? Stateful sets or something else?

vibinm commented 4 years ago

Hi,

I scaled it as stateful sets and happy to say that the issue is fixed.

The issue was due to failure of RMQ clustering (due to SELINUX restrictions on the container).

Will close this issue.