ansible / awx-operator

An Ansible AWX operator for Kubernetes built with Operator SDK and Ansible. 🤖
https://www.github.com/ansible/awx
Apache License 2.0
1.26k stars 633 forks source link

Backup and Restore modifies pg secret which puts original deployment in a bad state #1544

Open rooftopcellist opened 1 year ago

rooftopcellist commented 1 year ago

Please confirm the following

Bug Summary

If a Restore is done in the same namespace as the original deployment, the same postgres-configuration secret is used for both. The problem is that we modify the postgres-configuration secret to specify the new host. For example, for deployments awx-1 and awx-2 respectively, these will be the resolvable hosts based on the Service resources created:

Currently, when a restore is done, the secret is modified so that the original awx-1-postgres-13 value is replaced with host: awx-2-postgres-13. This results in the awx-operator's reconciliation loop failing repetitively when reconciling initial awx-1 deployment, specifically when the postgres-configuration-secret's host value is used.

The following error can be found in the operator logs:

raise ex.with_traceback(None)", "django.db.utils.OperationalError: connection is bad: Name or service not known"]

The current work-around is to delete the original deployment and rely on the backup. This is sufficient for most use cases, however this is still a bug that should be fixed imo.

AWX Operator version

2.5.1

AWX version

22.7.0

Kubernetes platform

openshift

Kubernetes/Platform version

v4.11

Modifications

no

Steps to reproduce

Backup and Restore testing

I created a backup, which ran successfully:

$ cat hacking/backup-awx.yml
---
apiVersion: awx.ansible.com/v1beta1
kind: AWXBackup
metadata:
  name: awxbackup-2023-08-29
spec:
  deployment_name: awx
  no_log: false

Then after the reconciliation loop stopped for the backup, I created a restore:

$ cat hacking/restore-awx.yml 
---
apiVersion: awx.ansible.com/v1beta1
kind: AWXRestore
metadata:
  name: restore1
  # namespace: ca-awx
spec:
  deployment_name: awx-2
  backup_name: awxbackup-2023-08-29
  no_log: false

reconciliation loop shows errors when reconciling the initial deployment.

Expected results

awx-1 and awx-2 deployments should be able to live together in the same namespace and the restore role should not step on this.

Actual results

errors in reconciliation loop.

Additional information

To fix this, we may be able to create a new postgres secret name by adding a unique hash if a secret by the same name exists in the namespace.

Operator Logs

No response

rooftopcellist commented 12 months ago

The workaround for this is to delete the awx-1 deployment after taking a backup and it's secrets, then restore from the backup. However, it would be nice to not have to do this.

Another option would be to migrate the backup PVC to another namespace, then do it there, but that is a hassle.