Open ktomaszx opened 1 year ago
Hello - is there a reason you think the word "false" in this log message is the source of the problem you are seeing? I think this might be a red herring.
Execution Environments is working correctly and accepts projects to be run on it
Are you saying that you are unable to run projects? What behavior are you seeing? Are they stuck in Pending? Failing? Please expound.
No Execution Environments in AWX.
Can you take a look at the logs for the operator? It handles registering the default EEs.
I'll also note that the control plane ee (where project updates run) is an "always-on" container inside of the control plane pod. It is inherently different from the one-off pods spun up to run jobs. It sounds like you might be experiencing multiple problems here - or a more fundamental problem unrelated to the type of job.
Hi @shanemcd you may be right that "false" word can be a red herring but to be honest I do not have any other lead.
Are you saying that you are unable to run projects? What behavior are you seeing? Are they stuck in Pending? Failing?
Yes, right after saving project syncing starts and it fails with:
Execution Environment Missing resource
or in logs
RuntimeError: The project could not sync because there is no Execution Environment.
Can you take a look at the logs for the operator? It handles registering the default EEs.
This could be it!
I found that there is an error in task Register default execution environments
"Error while proxying request","error":"error dialing backend: x509: certificate has expired or is not yet valid:
I assume that this is a k8s node side error. Right? Thanks a lot for guiding me!
| I assume that this is a k8s node side error. Right?
yes this sounds like k8s side, do you get this error consistently after retrying the deployment?
Interestingly, I have been getting this error consistently while trying to run a workflow job with various steps. And a few steps have many job slices (like 30). I am not sure what information to provide to help debug this error further. I do not know if this is because I am initiating too many slices or something else. Any advice would be appreciated. And I can open a new Issue also. I just don't know how or what to information to collect/provide.
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py", line 435, in _run_internal
lines = resultfile.readlines()
OSError: read() should have returned a bytes object, not 'NoneType'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/jobs.py", line 604, in run
res = receptor_job.run()
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py", line 317, in run
res = self._run_internal(receptor_ctl)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/receptor.py", line 444, in _run_internal
raise RuntimeError(detail)
RuntimeError: Error streaming stdin to pod awx/automation-job-14930-sclwd. Error: error dialing backend: remote error: tls: internal error
@nabheet can you provide awx-ee container logs (in the awx-task pod)? any errors?
So I ended up adding the following annotation to the AWX-EE Pod Spec which seems to have prevented this issue for now:
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
BTW, I think I forgot to mention that I am using the container instance group thingamajigs.
@nabheet , can u share the steps of add annotations there ?
Please confirm the following
Bug Summary
Hi, in freshly deployed AWX instance awx-ee container returns following Debug log
DEBUG 2022/12/19 09:25:59 Client connected to control service @ DEBUG 2022/12/19 09:25:59 Control service closed DEBUG 2022/12/19 09:25:59 Client disconnected from control service @ DEBUG 2022/12/19 09:26:14 Sending service advertisement: &{awx-fsdf34s3a-fns26 control 2022-12-19 09:26:14.60991136 +0000 UTC m=+1085.122714070 1 map[type:Control Service] [{local false} {kubernetes-runtime-auth false} {kubernetes-incluster-auth false}]
Do you have any idea what is the reason of false status?
AWX version
21.9.0
Select the relevant components
Installation method
kubernetes
Modifications
no
Ansible version
No response
Operating system
No response
Web browser
No response
Steps to reproduce
awx instance is deployed with helm helm repo add awx-operator https://ansible.github.io/awx-operator/ helm upgrade --install test-awx-operator operator /awx-operator -n awx --create-namespace -f helm/values-prod.yaml values file contains only postgres_storage_class and ingress configuration
Expected results
Execution Environments is working correctly and accepts projects to be run on it
Actual results
No Execution Environments in AWX. EE containers returns false when sending service advertisement
Additional information
No response