ansible / awx

AWX provides a web-based user interface, REST API, and task engine built on top of Ansible. It is one of the upstream projects for Red Hat Ansible Automation Platform.
Other
14.04k stars 3.42k forks source link

awx-ee container returns false when sending service advertisement #13349

Open ktomaszx opened 1 year ago

ktomaszx commented 1 year ago

Please confirm the following

Bug Summary

Hi, in freshly deployed AWX instance awx-ee container returns following Debug log

DEBUG 2022/12/19 09:25:59 Client connected to control service @ DEBUG 2022/12/19 09:25:59 Control service closed DEBUG 2022/12/19 09:25:59 Client disconnected from control service @ DEBUG 2022/12/19 09:26:14 Sending service advertisement: &{awx-fsdf34s3a-fns26 control 2022-12-19 09:26:14.60991136 +0000 UTC m=+1085.122714070 1 map[type:Control Service] [{local false} {kubernetes-runtime-auth false} {kubernetes-incluster-auth false}]

Do you have any idea what is the reason of false status?

AWX version

21.9.0

Select the relevant components

Installation method

kubernetes

Modifications

no

Ansible version

No response

Operating system

No response

Web browser

No response

Steps to reproduce

awx instance is deployed with helm helm repo add awx-operator https://ansible.github.io/awx-operator/ helm upgrade --install test-awx-operator operator /awx-operator -n awx --create-namespace -f helm/values-prod.yaml values file contains only postgres_storage_class and ingress configuration

Expected results

Execution Environments is working correctly and accepts projects to be run on it

Actual results

No Execution Environments in AWX. EE containers returns false when sending service advertisement

Additional information

No response

shanemcd commented 1 year ago

Hello - is there a reason you think the word "false" in this log message is the source of the problem you are seeing? I think this might be a red herring.

Execution Environments is working correctly and accepts projects to be run on it

Are you saying that you are unable to run projects? What behavior are you seeing? Are they stuck in Pending? Failing? Please expound.

No Execution Environments in AWX.

Can you take a look at the logs for the operator? It handles registering the default EEs.

I'll also note that the control plane ee (where project updates run) is an "always-on" container inside of the control plane pod. It is inherently different from the one-off pods spun up to run jobs. It sounds like you might be experiencing multiple problems here - or a more fundamental problem unrelated to the type of job.

ktomaszx commented 1 year ago

Hi @shanemcd you may be right that "false" word can be a red herring but to be honest I do not have any other lead.

Are you saying that you are unable to run projects? What behavior are you seeing? Are they stuck in Pending? Failing?

Yes, right after saving project syncing starts and it fails with: Execution Environment Missing resource

or in logs

RuntimeError: The project could not sync because there is no Execution Environment.

Can you take a look at the logs for the operator? It handles registering the default EEs.

This could be it! I found that there is an error in task Register default execution environments "Error while proxying request","error":"error dialing backend: x509: certificate has expired or is not yet valid:

I assume that this is a k8s node side error. Right? Thanks a lot for guiding me!

fosterseth commented 1 year ago

| I assume that this is a k8s node side error. Right?

yes this sounds like k8s side, do you get this error consistently after retrying the deployment?

nabheet commented 1 year ago

Interestingly, I have been getting this error consistently while trying to run a workflow job with various steps. And a few steps have many job slices (like 30). I am not sure what information to provide to help debug this error further. I do not know if this is because I am initiating too many slices or something else. Any advice would be appreciated. And I can open a new Issue also. I just don't know how or what to information to collect/provide.

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py", line 435, in _run_internal
    lines = resultfile.readlines()
OSError: read() should have returned a bytes object, not 'NoneType'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py", line 604, in run
    res = receptor_job.run()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py", line 317, in run
    res = self._run_internal(receptor_ctl)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py", line 444, in _run_internal
    raise RuntimeError(detail)
RuntimeError: Error streaming stdin to pod awx/automation-job-14930-sclwd. Error: error dialing backend: remote error: tls: internal error
fosterseth commented 1 year ago

@nabheet can you provide awx-ee container logs (in the awx-task pod)? any errors?

nabheet commented 1 year ago

So I ended up adding the following annotation to the AWX-EE Pod Spec which seems to have prevented this issue for now:

annotations:
  cluster-autoscaler.kubernetes.io/safe-to-evict: "false"

BTW, I think I forgot to mention that I am using the container instance group thingamajigs.

yuanyuefeng commented 1 year ago

@nabheet , can u share the steps of add annotations there ?