ansible / awx-operator

An Ansible AWX operator for Kubernetes built with Operator SDK and Ansible. 🤖
https://www.github.com/ansible/awx
Apache License 2.0
1.26k stars 631 forks source link

pg_dump seg faulting when trying to run awx backup role #1908

Closed bryan-srg closed 4 months ago

bryan-srg commented 4 months ago

Please confirm the following

Bug Summary

AWX backup role is failing to create a fully formed backup on the Persistent Volume.

AWX Operator version

2.17.0

AWX version

24.4.0

Kubernetes platform

other (please specify in additional information)

Kubernetes/Platform version

AWS EKS 1.28

Modifications

yes

Steps to reproduce

Apply the following manifest to our AWX deployment:

---
apiVersion: awx.ansible.com/v1beta1
kind: AWXBackup
metadata:
  name: awx-backup-2024-06-26
  namespace: awx
spec:
  deployment_name: awx
  no_log: false

Then stream the logs of the awx-operator-controller-manager, and keep an eye out for the failure.

bash: line 18:    28 Segmentation fault      (core dumped) PGPASSWORD='REDACTED' pg_dump --clean --create -h awx.cluster-cupjhqaf3dto.ap-southeast-2.rds.amazonaws.com -U awx -d awx -p 3306 -F custom > /backups/tower-openshift-backup-2024-06-26-030439/tower.db\
Terminated\

Checking the content of the PV (an AWS EBS volume) I find this at the root of the filesystem:

/ # cd /awx-backup/
/awx-backup # ls -l
total 36
drwx------    2 26       root         16384 Jun 26 02:30 lost+found
drwxr-xr-x    2 26       tape          4096 Jun 26 02:50 tower-openshift-backup-2024-06-26-025022
drwxr-xr-x    2 26       tape          4096 Jun 26 02:51 tower-openshift-backup-2024-06-26-025140
drwxr-xr-x    2 26       tape          4096 Jun 26 02:52 tower-openshift-backup-2024-06-26-025246
drwxr-xr-x    2 26       tape          4096 Jun 26 03:04 tower-openshift-backup-2024-06-26-030439
drwxr-xr-x    2 26       tape          4096 Jun 26 03:05 tower-openshift-backup-2024-06-26-030552

and each one of the tower-openshift-backup-* directories contains a 0 byte tower.db file:

/awx-backup # cd tower-openshift-backup-2024-06-26-030552/
/awx-backup/tower-openshift-backup-2024-06-26-030552 # ls -l
total 0
-rw-r--r--    1 26       tape             0 Jun 26 03:06 tower.db

Expected results

The backups should work properly.

Actual results

pg_dump segfaults when trying to write the database dump to the Persistent Volume.

Additional information

Postgres is external, hosted on AWS RDS Aurora Postgres (v15 compatible).

Operator Logs

TASK [backup : Write pg_dump to backup on PVC] *********************************
task path: /opt/ansible/roles/backup/tasks/postgres.yml:127
fatal: [localhost]: FAILED! => {\"changed\": true, \"failed_when_result\": true, \"rc\": 139, \"return_code\": 139, \"stderr\": \"bash: line 18:    28 Segmentation fault      (core dumped) PGPASSWORD='REDACTED' pg_dump --clean --create -h awx.cluster-cupjhqaf3dto.ap-southeast-2.rds.amazonaws.com -U awx -d awx -p 3306 -F custom > /backups/tower-openshift-backup-2024-06-26-030439/tower.db\
Terminated\
\", \"stderr_lines\": [\"bash: line 18:    28 Segmentation fault      (core dumped) PGPASSWORD='REDACTED' pg_dump --clean --create -h awx.cluster-cupjhqaf3dto.ap-southeast-2.rds.amazonaws.com -U awx -d awx -p 3306 -F custom > /backups/tower-openshift-backup-2024-06-26-030439/tower.db\", \"Terminated\"], \"stdout\": \"keepalive_pid: 27\
Dumping data from database...\
\", \"stdout_lines\": [\"keepalive_pid: 27\", \"Dumping data from database...\"]}
bryan-srg commented 4 months ago

I wondered what would happen if I tried running the same container image that the backup role uses from a local docker instance connecting to the database, and here's what I found:

#  docker container run -e PGPASSWORD 2189 pg_dump --clean --create -h 10.110.51.152 -U awx -d awx -p 3306 -F custom > tower_1.db
pg_dump: error: connection to server at "10.110.51.152", port 3306 failed: FATAL:  no PostgreSQL user name specified in startup packet
connection to server at "10.110.51.152", port 3306 failed: FATAL:  no PostgreSQL user name specified in startup packet
double free or corruption (out)

So I think there might be something wrong with the sclorg/postgresql image.

Trying the same thing using the official postgres container is successful - pg_dump runs fine and starts dumping the database content out.

bryan-srg commented 4 months ago

I've updated my manifest to pull the official postgres container in now instead of the sclorg one:

---
apiVersion: awx.ansible.com/v1beta1
kind: AWXBackup
metadata:
  name: awx-backup-2024-06-26
  namespace: awx
spec:
  deployment_name: awx
  no_log: false
  _postgres_image: docker.io/postgres
  _postgres_image_version: 15-alpine

This has progressed the AWX backup a bit. The 'tower.db' file and 'awx_object' files are written to the PV. Unfortunately, I'm now hitting yet another "The task includes an option with an undefined variable" error - this time in dump_secret.yml.

Relevant bit of log follows:

TASK [backup : Dump secret names from awx spec and data into file] *************
task path: /opt/ansible/roles/backup/tasks/secrets.yml:11
included: /opt/ansible/roles/backup/tasks/dump_secret.yml for localhost => (item=route_tls_secret)
included: /opt/ansible/roles/backup/tasks/dump_secret.yml for localhost => (item=ingress_tls_secret)
included: /opt/ansible/roles/backup/tasks/dump_secret.yml for localhost => (item=ldap_cacert_secret)
included: /opt/ansible/roles/backup/tasks/dump_secret.yml for localhost => (item=bundle_cacert_secret)
included: /opt/ansible/roles/backup/tasks/dump_secret.yml for localhost => (item=ee_pull_credentials_secret)

TASK [backup : Get Secret Name] ************************************************
task path: /opt/ansible/roles/backup/tasks/dump_secret.yml:3
ok: [localhost] => {\"ansible_facts\": {\"_name\": \"\"}, \"changed\": false}

TASK [backup : Get secret] *****************************************************
task path: /opt/ansible/roles/backup/tasks/dump_secret.yml:9
skipping: [localhost] => {\"changed\": false, \"false_condition\": \"_name != ''\", \"skip_reason\": \"Conditional result was False\"}

TASK [backup : Set secret key] *************************************************
task path: /opt/ansible/roles/backup/tasks/dump_secret.yml:18
skipping: [localhost] => {\"changed\": false, \"false_condition\": \"_name != ''\", \"skip_reason\": \"Conditional result was False\"}

TASK [backup : Create and Add secret names and data to dictionary] *************
task path: /opt/ansible/roles/backup/tasks/dump_secret.yml:24
skipping: [localhost] => {\"changed\": false, \"false_condition\": \"_name != ''\", \"skip_reason\": \"Conditional result was False\"}

TASK [backup : Get Secret Name] ************************************************
task path: /opt/ansible/roles/backup/tasks/dump_secret.yml:3
ok: [localhost] => {\"ansible_facts\": {\"_name\": \"\"}, \"changed\": false}

TASK [backup : Get secret] *****************************************************
task path: /opt/ansible/roles/backup/tasks/dump_secret.yml:9
skipping: [localhost] => {\"changed\": false, \"false_condition\": \"_name != ''\", \"skip_reason\": \"Conditional result was False\"}

TASK [backup : Set secret key] *************************************************
task path: /opt/ansible/roles/backup/tasks/dump_secret.yml:18
skipping: [localhost] => {\"changed\": false, \"false_condition\": \"_name != ''\", \"skip_reason\": \"Conditional result was False\"}

TASK [backup : Create and Add secret names and data to dictionary] *************
task path: /opt/ansible/roles/backup/tasks/dump_secret.yml:24
skipping: [localhost] => {\"changed\": false, \"false_condition\": \"_name != ''\", \"skip_reason\": \"Conditional result was False\"}

TASK [backup : Get Secret Name] ************************************************
task path: /opt/ansible/roles/backup/tasks/dump_secret.yml:3
ok: [localhost] => {\"ansible_facts\": {\"_name\": \"\"}, \"changed\": false}

TASK [backup : Get secret] *****************************************************
task path: /opt/ansible/roles/backup/tasks/dump_secret.yml:9
skipping: [localhost] => {\"changed\": false, \"false_condition\": \"_name != ''\", \"skip_reason\": \"Conditional result was False\"}

TASK [backup : Set secret key] *************************************************
task path: /opt/ansible/roles/backup/tasks/dump_secret.yml:18
skipping: [localhost] => {\"changed\": false, \"false_condition\": \"_name != ''\", \"skip_reason\": \"Conditional result was False\"}

TASK [backup : Create and Add secret names and data to dictionary] *************
task path: /opt/ansible/roles/backup/tasks/dump_secret.yml:24
skipping: [localhost] => {\"changed\": false, \"false_condition\": \"_name != ''\", \"skip_reason\": \"Conditional result was False\"}

TASK [backup : Get Secret Name] ************************************************
task path: /opt/ansible/roles/backup/tasks/dump_secret.yml:3
ok: [localhost] => {\"ansible_facts\": {\"_name\": \"\"}, \"changed\": false}

TASK [backup : Get secret] *****************************************************
task path: /opt/ansible/roles/backup/tasks/dump_secret.yml:9
skipping: [localhost] => {\"changed\": false, \"false_condition\": \"_name != ''\", \"skip_reason\": \"Conditional result was False\"}

TASK [backup : Set secret key] *************************************************
task path: /opt/ansible/roles/backup/tasks/dump_secret.yml:18
skipping: [localhost] => {\"changed\": false, \"false_condition\": \"_name != ''\", \"skip_reason\": \"Conditional result was False\"}

TASK [backup : Create and Add secret names and data to dictionary] *************
task path: /opt/ansible/roles/backup/tasks/dump_secret.yml:24
skipping: [localhost] => {\"changed\": false, \"false_condition\": \"_name != ''\", \"skip_reason\": \"Conditional result was False\"}

TASK [backup : Get Secret Name] ************************************************
task path: /opt/ansible/roles/backup/tasks/dump_secret.yml:3
ok: [localhost] => {\"ansible_facts\": {\"_name\": \"ee-pull-credentials\"}, \"changed\": false}

TASK [backup : Get secret] *****************************************************
task path: /opt/ansible/roles/backup/tasks/dump_secret.yml:9
ok: [localhost] => {\"api_found\": true, \"changed\": false, \"resources\": []}

TASK [backup : Set secret key] *************************************************
task path: /opt/ansible/roles/backup/tasks/dump_secret.yml:18
fatal: [localhost]: FAILED! => {\"msg\": \"The task includes an option with an undefined variable. The error was: list object has no element 0. list object has no element 0\
\
The error appears to be in '/opt/ansible/roles/backup/tasks/dump_secret.yml': line 18, column 9, but may\
be elsewhere in the file depending on the exact syntax problem.\
\
The offending line appears to be:\
\
\
      - name: Set secret key\
        ^ here\
\"}

PLAY RECAP *********************************************************************
localhost                  : ok=68   changed=7    unreachable=0    failed=1    skipped=39   rescued=0    ignored=0  
bryan-srg commented 4 months ago

Okay - finally figured it out - the spec file for AWX operator that I deployed originally contained a line:

ee_pull_credentials_secret: ee-pull-credentials

But it seems that secret was never actually created - hence the problem in the playbook not being able to find its content to dump out during the backup process. I've "fixed" this by creating a dummy credential and putting it in that secret in the AWX deployment's namespace - this has allowed the backup to create successfully. It seems that most of AWX's functionality is not affected by this missing secret though, and so in my opinion the backup role could also be made more defensive, and if it can't find it, then it should just not try to back it up, and continue the rest of the backup regardless.

Is that a valid opinion?

bryan-srg commented 4 months ago

Seems this is a dupe of 1895 so I'll close it.