Open sgreinerCNS opened 2 months ago
Encountering the same issue. This can be reproduced by draining the node they are running on, on first boot on the new node this will happen. Recreating the pod on the new node will restore functionality.
k8s info:
Client Version: v1.28.10
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.10
AWX Resource Details
Labels: app.kubernetes.io/component=awx
app.kubernetes.io/managed-by=awx-operator
app.kubernetes.io/operator-version=2.12.2
app.kubernetes.io/part-of=awx-prod
Annotations: <none>
API Version: awx.ansible.com/v1beta1
Kind: AWX
Metadata:
Creation Timestamp: 2024-04-08T18:06:54Z
Generation: 2
Resource Version: 79405586
Some extra configuration that might be relevant:
web_extra_env: - name: LDAPTLS_CACERT
value: /etc/pki/ca-trust/source/anchors/bundle-ca.crt
Above file inside the container is the CA for a local LDAP domain
Status:
Admin Password Secret: <redact>
Admin User: <redact>
Broadcast Websocket Secret: <redact>
Conditions:
Last Transition Time: 2024-10-14T12:51:29Z
Reason:
Status: False
Type: Failure
Last Transition Time: 2024-10-14T12:50:18Z
Reason: Successful
Status: True
Type: Running
Last Transition Time: 2024-10-14T13:16:05Z
Reason: Successful
Status: True
Type: Successful
Image: quay.io/ansible/awx:23.9.0
Postgres Configuration Secret: <redact>
Secret Key Secret: <redact>
Version: 23.9.0
Our situation was similar, also involving a LDAPS CA and a CA Bundle (required because TLS Deep Inspection by Security Appliances).
---
apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
name: cns-awx
namespace: awx
spec:
image_pull_policy: Always
control_plane_ee_image: quay.io/ansible/awx-ee:23.3.0
init_container_image: quay.io/ansible/awx-ee
init_container_image_version: 24.6.1
ingress_type: Ingress
hostname: <redact>
ingress_annotations: ""
ingress_tls_secret: <redact>
admin_user: <redact>
admin_email: <redact>
admin_password_secret: <redact>
web_resource_requirements:
requests:
cpu: 200m
memory: 500Mi
task_resource_requirements:
requests:
cpu: 200m
memory: 500Mi
ldap_cacert_secret: <redact>
bundle_cacert_secret: <redact>
secret_key_secret: <redact>
projects_persistence: true
projects_existing_claim: cns-awx-storage-projects-claim
postgres_storage_requirements:
requests:
storage: 4Gi
postgres_storage_class: postgres
The _ldap_cacertsecret gets the "file" ldap-ca.crt and _bundle_cacertsecret get the "file" bundle-ca.crt via a secret
By setting _init_containerimage and pinning _init_container_imageversion to 24.6.1 I was able to avoid the buggy awx-ee:latest which cannot set ca-certificates.crt for some reason
After digging around for a while, as I've been facing the same problem in a custom EE built from Rocky Linux 9, I found out that the issue is related to changes in the ca-certificates
system package. The version 24.6.1
that still works has 2023.2.60_v7.0.306
installed, whereas the current latest
is running 2024.2.69_v8.0.303
.
After going through the RPM changelog, I noticed that not only have CA certificates been updated, but the update-ca-trust
script itself has been greatly changed, as can be seen in the commit history: https://gitlab.com/redhat/centos-stream/rpms/ca-certificates/-/commits/c9s/update-ca-trust
The old script, which is also part of 24.6.1
, is very trivial can still be found here.
The new script on the other hand, which has been introduced here and its latest version can be found here is much more complex and does more things than the old script.
One key change is that in addition to simply calling /usr/bin/trust extract
a couple times, it is now also trying to execute /usr/bin/ln
for creating symlinks, specifically those in the directory-hash
directory, which causes the issue here due to a lack of permissions. As the script itself explains, p11-kit
will make the directory-hash
directory unwritable, and due running as non-root, we do not have the benefits of CAP_DAC_OVERRIDE
.
I was able to verify that the current EE runs if the deployment of awx-task
and awx-web
would call update-ca-trust extract --output /etc/pki/ca-trust/extracted
in the init-bundle-ca-trust
init container. This will internally fill USER_DEST
in the script, which then triggers the extra code branch to run /usr/bin/chmod u+w
which fixes up the permissions of the directory-hash
directory.
Unfortunately I currently lack the time to submit this as a PR to awx-operator
, as I'm unsure about the potential impact when considering other EEs with different script versions, but it might be an easy fix. As a workaround, which has been good enough for me, I'm now copying the old script into my AWX EE:
additional_build_files:
- src: files/update-ca-trust
dest: files
additional_build_steps:
append_base:
# Copy legacy update-ca-trust script for compatibility with AWX Operator
- COPY --chmod=755 _build/files/update-ca-trust /usr/bin/update-ca-trust
This might also be of interest to @JoelKle who introduced this init container as part of PR #1846 in the awx-operator
project. The initial idea was to run this as root, but due to OpenShift compatibility a non-privileged approach was taken, which worked fine - until the update-ca-trust
script changed and broke this previously working solution.
Unfortunately I currently lack the time to submit this as a PR to
awx-operator
, as I'm unsure about the potential impact when considering other EEs with different script versions, but it might be an easy fix. As a workaround, which has been good enough for me, I'm now copying the old script into my AWX EE:additional_build_files: - src: files/update-ca-trust dest: files additional_build_steps: append_base: # Copy legacy update-ca-trust script for compatibility with AWX Operator - COPY --chmod=755 _build/files/update-ca-trust /usr/bin/update-ca-trust
If ansible-builder is not an option for you, you can also copy update-ca-trust from 24.6.1 into a custom init container built from latest in its dockerfile (or containerfile for you podman folks).
Thank you @ppmathis for your great analysis on that problem. I've opened a PR with the solution you proposed: https://github.com/ansible/awx-operator/pull/1985
The awx-web and awx-task kubernetes pods stop working with Init:CrashLoopBackOff
the reason was the init container's image quay.io/ansible/awx-ee:latest
ln: failed to create symbolic link '/etc/pki/ca-trust/extracted/pem/directory-hash/ca-certificates.crt': Permission denied
I manually edited the deployments to use quay.io/ansible/awx-ee:24.6.1 instead and the pods come up again. Unfortunately the awx-operator wants to change it back to the broken latest tag.