ansible / awx-ee

An Ansible execution environment for AWX project
https://quay.io/ansible/awx-ee
Other
139 stars 161 forks source link

image awx-ee:latest broken for use with awx-operator #258

Open sgreinerCNS opened 2 months ago

sgreinerCNS commented 2 months ago

The awx-web and awx-task kubernetes pods stop working with Init:CrashLoopBackOff

the reason was the init container's image quay.io/ansible/awx-ee:latest

ln: failed to create symbolic link '/etc/pki/ca-trust/extracted/pem/directory-hash/ca-certificates.crt': Permission denied

I manually edited the deployments to use quay.io/ansible/awx-ee:24.6.1 instead and the pods come up again. Unfortunately the awx-operator wants to change it back to the broken latest tag.

Jed-Giblin commented 1 month ago

Encountering the same issue. This can be reproduced by draining the node they are running on, on first boot on the new node this will happen. Recreating the pod on the new node will restore functionality.

k8s info:

Client Version: v1.28.10
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.10

AWX Resource Details

Labels:       app.kubernetes.io/component=awx
              app.kubernetes.io/managed-by=awx-operator
              app.kubernetes.io/operator-version=2.12.2
              app.kubernetes.io/part-of=awx-prod
Annotations:  <none>
API Version:  awx.ansible.com/v1beta1
Kind:         AWX
Metadata:
  Creation Timestamp:  2024-04-08T18:06:54Z
  Generation:          2
  Resource Version:    79405586

Some extra configuration that might be relevant:

  web_extra_env:    - name: LDAPTLS_CACERT
  value: /etc/pki/ca-trust/source/anchors/bundle-ca.crt

Above file inside the container is the CA for a local LDAP domain

Status:
  Admin Password Secret:       <redact>
  Admin User:                  <redact>
  Broadcast Websocket Secret:  <redact>
  Conditions:
    Last Transition Time:         2024-10-14T12:51:29Z
    Reason:
    Status:                       False
    Type:                         Failure
    Last Transition Time:         2024-10-14T12:50:18Z
    Reason:                       Successful
    Status:                       True
    Type:                         Running
    Last Transition Time:         2024-10-14T13:16:05Z
    Reason:                       Successful
    Status:                       True
    Type:                         Successful
  Image:                          quay.io/ansible/awx:23.9.0
  Postgres Configuration Secret:  <redact>
  Secret Key Secret:              <redact>
  Version:                        23.9.0
sgreinerCNS commented 1 month ago

Our situation was similar, also involving a LDAPS CA and a CA Bundle (required because TLS Deep Inspection by Security Appliances).

---
apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
  name: cns-awx
  namespace: awx
spec:
  image_pull_policy: Always
  control_plane_ee_image: quay.io/ansible/awx-ee:23.3.0
  init_container_image: quay.io/ansible/awx-ee
  init_container_image_version: 24.6.1
  ingress_type: Ingress
  hostname: <redact>
  ingress_annotations: ""
  ingress_tls_secret: <redact>
  admin_user: <redact>
  admin_email: <redact>
  admin_password_secret: <redact>
  web_resource_requirements:
    requests:
      cpu: 200m
      memory: 500Mi
  task_resource_requirements:
    requests:
      cpu: 200m
      memory: 500Mi
  ldap_cacert_secret: <redact>
  bundle_cacert_secret: <redact>
  secret_key_secret: <redact>
  projects_persistence: true
  projects_existing_claim: cns-awx-storage-projects-claim
  postgres_storage_requirements:
    requests:
      storage: 4Gi
  postgres_storage_class: postgres

The _ldap_cacertsecret gets the "file" ldap-ca.crt and _bundle_cacertsecret get the "file" bundle-ca.crt via a secret

By setting _init_containerimage and pinning _init_container_imageversion to 24.6.1 I was able to avoid the buggy awx-ee:latest which cannot set ca-certificates.crt for some reason

ppmathis commented 1 month ago

After digging around for a while, as I've been facing the same problem in a custom EE built from Rocky Linux 9, I found out that the issue is related to changes in the ca-certificates system package. The version 24.6.1 that still works has 2023.2.60_v7.0.306 installed, whereas the current latest is running 2024.2.69_v8.0.303.

After going through the RPM changelog, I noticed that not only have CA certificates been updated, but the update-ca-trust script itself has been greatly changed, as can be seen in the commit history: https://gitlab.com/redhat/centos-stream/rpms/ca-certificates/-/commits/c9s/update-ca-trust

The old script, which is also part of 24.6.1, is very trivial can still be found here.

The new script on the other hand, which has been introduced here and its latest version can be found here is much more complex and does more things than the old script.

One key change is that in addition to simply calling /usr/bin/trust extract a couple times, it is now also trying to execute /usr/bin/ln for creating symlinks, specifically those in the directory-hash directory, which causes the issue here due to a lack of permissions. As the script itself explains, p11-kit will make the directory-hash directory unwritable, and due running as non-root, we do not have the benefits of CAP_DAC_OVERRIDE.

I was able to verify that the current EE runs if the deployment of awx-task and awx-web would call update-ca-trust extract --output /etc/pki/ca-trust/extracted in the init-bundle-ca-trust init container. This will internally fill USER_DEST in the script, which then triggers the extra code branch to run /usr/bin/chmod u+w which fixes up the permissions of the directory-hash directory.

Unfortunately I currently lack the time to submit this as a PR to awx-operator, as I'm unsure about the potential impact when considering other EEs with different script versions, but it might be an easy fix. As a workaround, which has been good enough for me, I'm now copying the old script into my AWX EE:

additional_build_files:
  - src: files/update-ca-trust
    dest: files

additional_build_steps:
  append_base:
    # Copy legacy update-ca-trust script for compatibility with AWX Operator
    - COPY --chmod=755 _build/files/update-ca-trust /usr/bin/update-ca-trust

This might also be of interest to @JoelKle who introduced this init container as part of PR #1846 in the awx-operator project. The initial idea was to run this as root, but due to OpenShift compatibility a non-privileged approach was taken, which worked fine - until the update-ca-trust script changed and broke this previously working solution.

zendritic commented 3 weeks ago

Unfortunately I currently lack the time to submit this as a PR to awx-operator, as I'm unsure about the potential impact when considering other EEs with different script versions, but it might be an easy fix. As a workaround, which has been good enough for me, I'm now copying the old script into my AWX EE:

additional_build_files:
  - src: files/update-ca-trust
    dest: files

additional_build_steps:
  append_base:
    # Copy legacy update-ca-trust script for compatibility with AWX Operator
    - COPY --chmod=755 _build/files/update-ca-trust /usr/bin/update-ca-trust

If ansible-builder is not an option for you, you can also copy update-ca-trust from 24.6.1 into a custom init container built from latest in its dockerfile (or containerfile for you podman folks).

JoelKle commented 3 weeks ago

Thank you @ppmathis for your great analysis on that problem. I've opened a PR with the solution you proposed: https://github.com/ansible/awx-operator/pull/1985