Open sgreinerCNS opened 3 weeks ago
Encountering the same issue. This can be reproduced by draining the node they are running on, on first boot on the new node this will happen. Recreating the pod on the new node will restore functionality.
k8s info:
Client Version: v1.28.10
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.10
AWX Resource Details
Labels: app.kubernetes.io/component=awx
app.kubernetes.io/managed-by=awx-operator
app.kubernetes.io/operator-version=2.12.2
app.kubernetes.io/part-of=awx-prod
Annotations: <none>
API Version: awx.ansible.com/v1beta1
Kind: AWX
Metadata:
Creation Timestamp: 2024-04-08T18:06:54Z
Generation: 2
Resource Version: 79405586
Some extra configuration that might be relevant:
web_extra_env: - name: LDAPTLS_CACERT
value: /etc/pki/ca-trust/source/anchors/bundle-ca.crt
Above file inside the container is the CA for a local LDAP domain
Status:
Admin Password Secret: <redact>
Admin User: <redact>
Broadcast Websocket Secret: <redact>
Conditions:
Last Transition Time: 2024-10-14T12:51:29Z
Reason:
Status: False
Type: Failure
Last Transition Time: 2024-10-14T12:50:18Z
Reason: Successful
Status: True
Type: Running
Last Transition Time: 2024-10-14T13:16:05Z
Reason: Successful
Status: True
Type: Successful
Image: quay.io/ansible/awx:23.9.0
Postgres Configuration Secret: <redact>
Secret Key Secret: <redact>
Version: 23.9.0
Our situation was similar, also involving a LDAPS CA and a CA Bundle (required because TLS Deep Inspection by Security Appliances).
---
apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
name: cns-awx
namespace: awx
spec:
image_pull_policy: Always
control_plane_ee_image: quay.io/ansible/awx-ee:23.3.0
init_container_image: quay.io/ansible/awx-ee
init_container_image_version: 24.6.1
ingress_type: Ingress
hostname: <redact>
ingress_annotations: ""
ingress_tls_secret: <redact>
admin_user: <redact>
admin_email: <redact>
admin_password_secret: <redact>
web_resource_requirements:
requests:
cpu: 200m
memory: 500Mi
task_resource_requirements:
requests:
cpu: 200m
memory: 500Mi
ldap_cacert_secret: <redact>
bundle_cacert_secret: <redact>
secret_key_secret: <redact>
projects_persistence: true
projects_existing_claim: cns-awx-storage-projects-claim
postgres_storage_requirements:
requests:
storage: 4Gi
postgres_storage_class: postgres
The _ldap_cacertsecret gets the "file" ldap-ca.crt and _bundle_cacertsecret get the "file" bundle-ca.crt via a secret
By setting _init_containerimage and pinning _init_container_imageversion to 24.6.1 I was able to avoid the buggy awx-ee:latest which cannot set ca-certificates.crt for some reason
The awx-web and awx-task kubernetes pods stop working with Init:CrashLoopBackOff
the reason was the init container's image quay.io/ansible/awx-ee:latest
ln: failed to create symbolic link '/etc/pki/ca-trust/extracted/pem/directory-hash/ca-certificates.crt': Permission denied
I manually edited the deployments to use quay.io/ansible/awx-ee:24.6.1 instead and the pods come up again. Unfortunately the awx-operator wants to change it back to the broken latest tag.