ansible / awx-operator

An Ansible AWX operator for Kubernetes built with Operator SDK and Ansible. 🤖
https://www.github.com/ansible/awx
Apache License 2.0
1.23k stars 622 forks source link

AWX backup fails on K8s/K3s #1518

Open D1StrX opened 1 year ago

D1StrX commented 1 year ago

Please confirm the following

Bug Summary

On both k8s and k3s, embedded and external Postgresql DB the AWX backup fails with the exact same error:

[backup : include_tasks] **************************************************
task path: /opt/ansible/roles/backup/tasks/creation.yml:37 included: /opt/ansible/roles/backup/tasks/postgres.yml for localhost
TASK [backup : Get PostgreSQL configuration] ***********************************
task path: /opt/ansible/roles/backup/tasks/postgres.yml:3\nfatal: [localhost]: FAILED! => {\"msg\": \"The task includes an option with an undefined variable. The error was: list object has no element 0\\n\\nThe error appears to be in '/opt/ansible/roles/backup/tasks/postgres.yml': line 3, column 3, but may\\nbe elsewhere in the file depending on the exact syntax problem.\\n\\nThe offending line appears to be:\\n\\n\\n- name: Get PostgreSQL configuration\\n  ^ here\\n\"}
PLAY RECAP *********************************************************************
localhost ok=16   changed=1    unreachable=0    failed=1    skipped=8    rescued=0    ignored=0 

AWX Operator version

2.4.0 - 2.6.0

AWX version

AWX 22.5.0 - 23.2.0

Kubernetes platform

kubernetes

Kubernetes/Platform version

1.27.4 k8s/k3s

Modifications

no

Steps to reproduce

---
apiVersion: awx.ansible.com/v1beta1
kind: AWXBackup
metadata:
  name: awxbackup-10-8-2023
  namespace: awx
spec:
  deployment_name: <name>
  backup_storage_class: '<sc>'
  backup_storage_requirements: '1Gi'

Expected results

Succesful backup

Actual results

The error was: list object has no element 0.

When performing only the 2 k8sclusterinfo tasks locally in ansible., one for setting the fact this__awx and the other one for pg_config, it works fine.

When running this playbook inside the awx operator container, it says ansible_operator_meta is undefined.

---
- name: Backup AWX
  hosts: localhost
  gather_facts: false
  roles:
    - backup

Additional information

---
apiVersion: awx.ansible.com/v1beta1
kind: AWXBackup
metadata:
  name: awxbackup-10-8-2023
  namespace: awx
spec:
  deployment_name: <name>
  backup_storage_class: '<sc>'
  backup_storage_requirements: '1Gi'

Operator Logs

[backup : include_tasks] **************************************************
task path: /opt/ansible/roles/backup/tasks/creation.yml:37 included: /opt/ansible/roles/backup/tasks/postgres.yml for localhost
TASK [backup : Get PostgreSQL configuration] ***********************************
task path: /opt/ansible/roles/backup/tasks/postgres.yml:3\nfatal: [localhost]: FAILED! => {\"msg\": \"The task includes an option with an undefined variable. The error was: list object has no element 0\\n\\nThe error appears to be in '/opt/ansible/roles/backup/tasks/postgres.yml': line 3, column 3, but may\\nbe elsewhere in the file depending on the exact syntax problem.\\n\\nThe offending line appears to be:\\n\\n\\n- name: Get PostgreSQL configuration\\n  ^ here\\n\"}
PLAY RECAP *********************************************************************
localhost ok=16   changed=1    unreachable=0    failed=1    skipped=8    rescued=0    ignored=0 
dbx-3 commented 12 months ago

I'm encountering the same error on an EKS cluster.

AWX Operator Version: 2.5.2 AWX Version 23.0.0 Kubernetes Version 1.26

{
  "msg": "The task includes an option with an undefined variable. The error was: 'ansible_operator_meta' is undefined. 'ansible_operator_meta' is undefined\n\nThe error appears to be in '/runner/project/playbooks/roles/backup/tasks/creation.yml': line 2, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n---\n- name: Patching labels to {{ kind }} kind\n  ^ here\nWe could be wrong, but this one looks like it might be an issue with\nmissing quotes. Always quote template expression brackets when they\nstart a value. For instance:\n\n    with_items:\n      - {{ foo }}\n\nShould be written as:\n\n    with_items:\n      - \"{{ foo }}\"\n",
  "_ansible_no_log": false
}
D1StrX commented 10 months ago

@AlanCoding This task is a bigger issue. The latest change for the partition table... Broke the partition table (I believe creation).

AlanCoding commented 10 months ago

Yeah, it could be the bug that https://github.com/ansible/awx/pull/14572 is trying to fix.

The introduction of the bug https://github.com/ansible/awx/commit/f5922f76fa852fde2336fcd69c6db630ff8e72b7 made it into the last release.

arnaudmut commented 9 months ago

i have the same issue.

AWX Operator Version: 2.8.0 AWX Version 23.5.0 Kubernetes v1.27.7

Fatal: [localhost] FAILED!
    Message: The task includes an option with an undefined variable. The error was: list object has no element 0. list object has no element 0.

    The error appears to be in '/opt/ansible/roles/backup/tasks/postgres.yml':
        Line: 3, Column: 3, but may be elsewhere in the file depending on the exact syntax problem.

    The offending line appears to be:

    - name: Get PostgreSQL configuration
      ^ here
godeater commented 6 months ago

I've just run into this issue myself, and I'm not clear why the issues referenced in @AlanCoding 's post are relevant to it? The playbook / role are complaining about an undefined variable which seems to me to be independent of the underlying problems with postgres partitions. As far as I can see - the ansible_operator_meta variable should be supplied from the operator_sdk.util collection - but this doesn't appear to be case (at least on my EKS cluster).

Error I'm seeing:

{
  "msg": "The task includes an option with an undefined variable. The error was: 'ansible_operator_meta' is undefined. 'ansible_operator_meta' is undefined\n\nThe error appears to be in '/runner/requirements_roles/srg_awx_backup/tasks/creation.yml': line 2, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n---\n- name: Patching labels to {{ kind }} kind\n  ^ here\nWe could be wrong, but this one looks like it might be an issue with\nmissing quotes. Always quote template expression brackets when they\nstart a value. For instance:\n\n    with_items:\n      - {{ foo }}\n\nShould be written as:\n\n    with_items:\n      - \"{{ foo }}\"\n",
  "_ansible_no_log": false
}
D1StrX commented 4 months ago

I am happy to share that the backup with AWXBackup works for me. Not sure when a patch has been released or what else, but the backup part works. Haven't tried a restore yet.

D1StrX commented 4 months ago

I've just run into this issue myself, and I'm not clear why the issues referenced in @AlanCoding 's post are relevant to it? The playbook / role are complaining about an undefined variable which seems to me to be independent of the underlying problems with postgres partitions. As far as I can see - the ansible_operator_meta variable should be supplied from the operator_sdk.util collection - but this doesn't appear to be case (at least on my EKS cluster).

Error I'm seeing:

{
  "msg": "The task includes an option with an undefined variable. The error was: 'ansible_operator_meta' is undefined. 'ansible_operator_meta' is undefined\n\nThe error appears to be in '/runner/requirements_roles/srg_awx_backup/tasks/creation.yml': line 2, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n---\n- name: Patching labels to {{ kind }} kind\n  ^ here\nWe could be wrong, but this one looks like it might be an issue with\nmissing quotes. Always quote template expression brackets when they\nstart a value. For instance:\n\n    with_items:\n      - {{ foo }}\n\nShould be written as:\n\n    with_items:\n      - \"{{ foo }}\"\n",
  "_ansible_no_log": false
}

Perhaps a good time to try again? @godeater

D1StrX commented 2 months ago

And I have to mention that backups don't work anymore... just like https://github.com/ansible/awx-operator/issues/879#issuecomment-2166487493 I see the same behavior as @vivekshete9 describes, endless re-spinning up. I use the most basic example as described in the docs.

---
apiVersion: awx.ansible.com/v1beta1
kind: AWXBackup
metadata:
  name: <name>
  namespace: <namespace>
spec:
  deployment_name: <deploymentname>
  backup_storage_class: "<storageclass>"
  backup_storage_requirements: "1Gi"
  backup_pvc_namespace: "<namespace>"
  image_pull_policy: "IfNotPresent"
  clean_backup_on_delete: false
  no_log: true

With a container connected to the backup pvc, after a backup is "created", I see a lot of tower-openshift-backup-xxxxx with tower.db inside of the folder. No clue if this is valid data.

Even with no_log: false I see zero output, nor in the awxbackups object; kubectl describe awxbackups.awx.ansible.com

...
Status:
  Conditions:
    Last Transition Time:  2024-06-19T20:22:43Z
    Reason:                Failed
    Status:                True
    Type:                  Failure
    Last Transition Time:  2024-06-19T20:22:43Z
    Reason:
    Status:                False
    Type:                  Successful
    Last Transition Time:  2024-06-19T20:23:35Z
    Reason:                Running
    Status:                True
    Type:                  Running
Events:                    <none>
bryan-srg commented 2 months ago

And I have to mention that backups don't work anymore... just like #879 (comment) I see the same behavior as @vivekshete9 describes, endless re-spinning up. I use the most basic example as described in the docs.

Have you had a look at https://github.com/ansible/awx-operator/issues/1908 ?

It seems the backup object itself doesn't log anything - but I found that as soon as you apply the manifest for the backup to your cluster, the logs from the awx-operator do show what's going on - and in my case at least (as above), it's because pg_dump is segfaulting.

D1StrX commented 2 months ago

I see, also seeing the same errors in my operator as described in #1908 .

bryan-srg commented 2 months ago

I see, also seeing the same errors in my operator as described in #1908 .

I fixed the pg_dump seg faults by changing the postgres image used to the official one, rather than the sclorg one (see #1908 again). Does that help your case?

D1StrX commented 2 months ago

The use of the custom issue indeed results in a successful backup. But I see this rather as a workaround then a fix. Also the documentation lacks awxbackup and it's possible options? https://ansible.readthedocs.io/projects/awx/en/latest/search.html?q=backup&check_keywords=yes&area=default

bryan-srg commented 2 months ago

I agree it's a workaround, and that the docs could be better (I had to dig into the source to find that you could override the image with those options.).

Unfortunately it doesn't seem like anyone from the project is paying attention to this issue (and fair enough, it's open source, not paid work, they can choose to do what they like) - so we are where we are. ¯_(ツ)_/¯

parkerfath commented 1 month ago

Ran into this today with Operator 2.16.1. Same "The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'backupClaim'. 'dict object' has no attribute 'backupClaim'"

So it seems like backup and restore is just fully broken for AWX? Has anyone found a good fallback plan for resilience? Make a backup of your DB and hope that you can piece it all together in the event of your data being lost?

EDIT: I'm seeing "workaround" mentioned in the comment above this one, and also in #1902 but this one also refers to #1908 and that one was closed as a dupe of #1895 . Could someone who understands the issue a little better post a concise summary of the workaround steps in one place?

D1StrX commented 1 month ago

I have come to the conclusion that the AWXBackup CRD and backend code is not written to leverage Kubernetes native capabilities. That's from an admin perspective. IMHO AWXBackup should create a Kubernetes cronjob that can be scheduled to your preference. Instead, you currently have to create the resource object each time. And creating this object with AWX/Tower requires Kubernetes credentials, not the easiest thing to setup with AWX/Tower or best approach.

It also lacks the option to directly write the backup to another storage backend, like S3 (e.g. AWS or MinIO). Currently I have created these resources to automate the backup process, all from within a K8s cluster. If you want to use this, adjust and verify these values according to your preferences:

All resources

Cronjob create awxbackup

Cronjob S3 upload

NOTE: This uploads all directories in the backup path and keeps the most recent one until the next run, at which point the new most recent directory will be kept.

Cronjob cleanup awxbackup

Check backups

Secret

RBAC

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: awx-backup-role
  namespace: awx # Replace with your namespace
rules:
  - apiGroups:
      - awx.ansible.com
    resources:
      - awxbackups
    verbs:
      - get
      - create
      - list
      - watch
      - delete
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: awx-backup-rolebinding
  namespace: awx # Replace with your namespace
subjects:
  - kind: ServiceAccount
    name: awx-backup-sa
    namespace: awx # Replace with your namespace
roleRef:
  kind: Role
  name: awx-backup-role
  apiGroup: rbac.authorization.k8s.io
apiVersion: v1
kind: ServiceAccount
metadata:
  name: awx-backup-sa
  namespace: awx # Replace with your namespace

Cronjobs

Cronjob create awxbackup

apiVersion: batch/v1
kind: CronJob
metadata:
  name: create-awxbackup
  namespace: awx # Replace with your namespace
spec:
  schedule: "45 2 * * *" # Runs daily at 2:45 AM (UTC)
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: awx-backup-sa
          containers:
            - name: create-awx-backup
              image: bitnami/kubectl:latest
              command:
                - /bin/sh
                - -c
                - |
                  cat <<EOF | kubectl apply -f -
                  apiVersion: awx.ansible.com/v1beta1
                  kind: AWXBackup
                  metadata:
                    name: awxbackup-$(date +'%Y-%m-%d-%H-%M-%S')
                    namespace: awx
                  spec:
                    deployment_name: awx
                    backup_storage_class: "<storageclass>"
                    _postgres_image: docker.io/postgres
                    _postgres_image_version: 15-alpine
                    backup_storage_requirements: "1Gi"
                    backup_pvc_namespace: "awx"
                    image_pull_policy: "IfNotPresent"
                    clean_backup_on_delete: false # Leave false, only deletes pvc when resource AWXBackup is deleted
                    no_log: true
                  EOF
          restartPolicy: OnFailure

Cronjob S3 upload

apiVersion: batch/v1
kind: CronJob
metadata:
  name: s3-upload-awxbackup
  namespace: awx # Replace with your namespace
spec:
  schedule: "0 3 * * *" # Runs daily at 3:00 AM (UTC)
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: backup-container
              image: amazon/aws-cli
              envFrom:
                - secretRef:
                    name: s3-credentials-awx-backup
              command:
                - /bin/bash
                - -c
                - |
                  aws s3 cp /backupdata s3://<bucket>>/ --recursive --endpoint-url https://subdomain.domain.tld:api_port
                  # Define the directory
                  DIR="/backupdata"

                  # Find the latest directory and store its name
                  LATEST_DIR=$(ls -td ${DIR}/*/ | head -n 1)

                  # Remove trailing slash from directory name
                  LATEST_DIR=${LATEST_DIR%/}
                  echo "Latest dir:" $LATEST_DIR

                  # Delete all directories except the latest one
                  find ${DIR} -maxdepth 1 -type d ! -path "${LATEST_DIR}" ! -path "${DIR}" -exec rm -rf {} +
              volumeMounts:
                - name: data-volume
                  mountPath: /backupdata
          restartPolicy: OnFailure
          volumes:
            - name: data-volume
              persistentVolumeClaim:
                claimName: awx-backup-claim

Cronjob cleanup awxbackup

apiVersion: batch/v1
kind: CronJob
metadata:
  name: cleanup-awxbackup
  namespace: awx # Replace with your namespace
spec:
  schedule: "30 3 * * *" # Runs daily at 3:30 AM (UTC)
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: awx-backup-sa
          containers:
            - name: cleanup-backups
              image: bitnami/kubectl:latest # A lightweight image with kubectl installed
              command:
                - /bin/bash
                - -c
                - |
                  namespace="awx" # Replace with your namespace
                  date_limit=$(date -d '7 days ago' --utc +'%Y-%m-%dT%H:%M:%SZ')
                  echo "Date limit: $date_limit"

                  # List all AWXBackup resources and their creation timestamps
                  backups=$(kubectl get awxbackups -n "$namespace" -o jsonpath='{.items[*].metadata.name}')
                  echo "All backups:" $backups

                  # Loop through backups and delete those older than the date limit
                  for backup in $backups; do
                    # Get the creation timestamp of the backup
                    creation_time=$(kubectl get awxbackup "$backup" -n "$namespace" -o jsonpath='{.metadata.creationTimestamp}')
                    # Compare the creation timestamp with the date limit
                    if [[ "$creation_time" < "$date_limit" ]]; then
                      kubectl delete awxbackup "$backup" -n "$namespace"
                      echo "Deleted backup: $backup"
                    fi
                  done
          restartPolicy: OnFailure

Secret

apiVersion: v1
kind: Secret
metadata:
  name: s3-credentials-awx-backup
  namespace: awx # Replace with your namespace
type: Opaque
data:
  AWS_ACCESS_KEY_ID: <base64 access_key>
  AWS_SECRET_ACCESS_KEY: <base64 secret_access_key>
  AWS_DEFAULT_REGION: <base64 region>
  AWS_ENDPOINT_URL: <base64 https://subdomain.domain.tld:api_port>

Check backups

apiVersion: v1
kind: Pod
metadata:
  name: busybox-pod
  namespace: awx # Replace with your namespace
spec:
  volumes:
    - name: data-volume
      persistentVolumeClaim:
        claimName: awx-backup-claim
  containers:
    - name: busybox-container
      image: busybox
      command: ["sleep", "3600"]
      volumeMounts:
        - name: data-volume
          mountPath: /backupdata
      resources:
        requests:
          cpu: 100m
          memory: 100Mi
        limits:
          cpu: 100m
          memory: 100Mi
kubectl exec -it busybox-pod -n awx -- sh
cd /backupdata
ls
parkerfath commented 1 month ago

I have come to the conclusion that the AWXBackup CRD and backend code is not written to leverage Kubernetes native capabilities.

Wow, thanks for putting all that together. I was coming to a similar conclusion myself; I've ended up writing a bunch of bash/curl commands to export most of what I need via the API and then documented some of the other details in an internal company wiki. I agree with your key point that it seems like a proper AWX backup needs to be easily moved to a different location, outside the cluster. That, plus struggles to get the backup and restore roles working fully, was one of my key reasons for putting in the work to script out the export/import via scripts.

D1StrX commented 1 month ago

Another thing possible is to write every AWX component in Ansible playbooks. I've done that too, so everything is documented and stateful. A native Ansible approach compared to bash scripts. https://docs.ansible.com/ansible/latest/collections/awx/awx/index.html