Postgres 15 pod: cannot create directory '/var/lib/pgsql/data/userdata': Permission denied"

Tfinn92 commented 8 months ago

Please confirm the following

[X] I agree to follow this project's code of conduct.
[X] I have checked the current issues for duplicates.
[X] I understand that the AWX Operator is open source software provided for free and that I might not receive a timely response.

Bug Summary

Updating to 2.13.1 through helm results in the postgres15 pod having the following error: cannot create directory '/var/lib/pgsql/data/userdata': Permission denied"

AWX Operator version

2.13.1

AWX version

24.0.0

Kubernetes platform

kubernetes

Kubernetes/Platform version

Rancher RKE2 v1.26.8+rke2r1 and another on v1.27.10+rke2r1

Modifications

no

Steps to reproduce

Have cluster with 2.12.2 installed and run helm upgrade awx-operator awx-operator/awx-operator

Expected results

pods come up no problem

Actual results

postgres15 pod CrashLoopBackOff Logs show "mkdir: cannot create directory '/var/lib/pgsql/data/userdata': Permission denied"

Additional information

No response

Operator Logs

No response

kurokobo commented 8 months ago

@Tfinn92 New PSQL is running as UID:26, so PV has to be writable by this UID.

kurokobo commented 8 months ago

@TheRealHaoLiu @fosterseth The same consideration can be applied for backup. UID:26 has to have write perm on PV to create backup directory.

mooky31 commented 8 months ago

Same problem here

TheRealHaoLiu commented 8 months ago

@kurokobo where should that modification be made? through a root user init container or is there something that could be done when setting up the PV

mooky31 commented 8 months ago

@kurokobo I think your advice is only valid if you use PV on local storage. Since I use rook-ceph, I can't set rights on the filesystem. Note : it was working with 2.13.0

JSN-1 commented 8 months ago

i had to create the volume, scale down the deployment / statefulset, mount volume into another pod and do

mkdir userdata chown 26:26 userdata

after the pod started and upgrade continued.

TheRealHaoLiu commented 8 months ago

init container to run as root user and does chown?

Tfinn92 commented 8 months ago

@Tfinn92 New PSQL is running as UID:26, so PV has to be writable by this UID.

While this is true, the old postgres 13 container that was deployed by the operator before was using root as it's user, so it seems like the devs got used to that freedom and tried applying the same logic in the 15 container, which as we are seeing, fails.

Tfinn92 commented 8 months ago

Heck, even looking in the PG13 container, the permissions expected now wouldn't be possible without the root user:

Obviously the pathing is a little different as well, but I imagine the same principles could be applied to the PG15 container

jyanesnotariado commented 8 months ago

This issue is not just for updates. I'm trying to start a new AWX instance from scratch and ran into the same problem.

mooky31 commented 8 months ago

I confirm, this was also a new install for me.

fosterseth commented 8 months ago

https://github.com/ansible/awx-operator/compare/devel...fosterseth:add_postgres_init_container?expand=1

@mooky31 @jyanesancert @Tfinn92 maybe something like that could help?

you can deploy image quay.io/fosterseth/awx-operator:postgres_init which has that change

to use, add whatever commands you want to your awx spec, e.g.

  init_postgres_extra_commands: |
    sudo touch /var/lib/pgsql/data/foo
    sudo touch /var/lib/pgsql/data/bar
    chown 26:26 /var/lib/pgsql/data/foo
    chown root:root /var/lib/pgsql/data/bar

so in your case maybe mkdir /var/lib/pgsql/data/userdata and chmod / chown it for user 26

IF that works for you let me know and we can get this change into devel

deepblue868 commented 8 months ago

For the new install adding this to spec fix it for me. It suppose to be the default in pervious version. postgres_data_path: /var/lib/postgresql/data/pgdata The install went through, but the postgress is still use /var/lib/pgsql/data/userdata which is not the pv.

kurokobo commented 8 months ago

@TheRealHaoLiu Sorry for my delayed response.

where should that modification be made? through a root user init container or is there something that could be done when setting up the PV

Since the images under sclorg are mostly maintained by Red Hat, so I think Red Hat should have best practices on this matter as well, rather than me 😞 Anyway as @fosterseth is trying using init container with root is possible solution.

Another well-known non-root PSQL implementation is Bitnami by VMware which has almost same restriction:

NOTE: As this is a non-root container, the mounted files and directories must have the proper permissions for the UID 1001. https://hub.docker.com/r/bitnami/postgresql/

In their charts for this PSQL, there are params to control initContainer to invoke chown / mkdir / chmod. If we enable this, PSQL has initContainer with runAsUser: 0 by default.

$ helm install bitnami/postgresql --generate-name --set volumePermissions.enabled=true
...

$ kubectl get statefulset postgresql-1710598237 -o yaml
...
      initContainers:
      - command:
        - /bin/sh
        - -ec
        - |
          chown 1001:1001 /bitnami/postgresql
          mkdir -p /bitnami/postgresql/data
          chmod 700 /bitnami/postgresql/data
          find /bitnami/postgresql -mindepth 1 -maxdepth 1 -not -name "conf" -not -name ".snapshot" -not -name "lost+found" | \
            xargs -r chown -R 1001:1001
          chmod -R 777 /dev/shm
        image: docker.io/bitnami/os-shell:12-debian-12-r16
        imagePullPolicy: IfNotPresent
        name: init-chmod-data
        resources: {}
        securityContext:
          runAsGroup: 0
          runAsNonRoot: false
          runAsUser: 0
          seccompProfile:
            type: RuntimeDefault
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /tmp
          name: empty-dir
          subPath: tmp-dir
        - mountPath: /bitnami/postgresql
          name: data
        - mountPath: /dev/shm
          name: dshm
...

There are related docs by Bitnami:

craph commented 8 months ago

How can we solve this kind of issue when default storage class is longhorn ? 🤔

kurokobo commented 8 months ago

@craph Try a workaround https://github.com/ansible/awx-operator/issues/1770#issuecomment-2000166172 by @fosterseth, or deploy temporary working pod that mounts the same PVC for PSQL for AWX and modify permissions.

craph commented 8 months ago

Hi @kurokobo ,

Thank you very much for the update

@craph Try a workaround #1770 (comment) by @fosterseth, or deploy temporary working pod that mounts the same PVC for PSQL for AWX and modify permissions.

I have just created a new temporary pod, and change the permission on data as requested.

But it look like the previous data haven't been migrated. I can't login anymore in AWX

I can see a job : awx-demo-migration and a pod awx-demo-migration-24.0.0 with state completed. but I can't login anymore

And the old StatefulSets for postgres 13 doesn't exist anymore but I still have the old pvc for postgres 13

craph commented 8 months ago

@kurokobo here is the log of the migration. But now, I can't login into my AWX instance

Operations to perform:
  Apply all migrations: auth, conf, contenttypes, dab_resource_registry, main, oauth2_provider, sessions, sites, social_django, sso
Running migrations:
  Applying contenttypes.0001_initial... OK
  Applying contenttypes.0002_remove_content_type_name... OK
  Applying auth.0001_initial... OK
  Applying main.0001_initial... OK
  Applying main.0002_squashed_v300_release... OK
  Applying main.0003_squashed_v300_v303_updates... OK
  Applying main.0004_squashed_v310_release... OK
  Applying conf.0001_initial... OK
  Applying conf.0002_v310_copy_tower_settings... OK
  Applying main.0005_squashed_v310_v313_updates... OK
  Applying main.0006_v320_release... OK
  Applying main.0007_v320_data_migrations... OK
  Applying main.0008_v320_drop_v1_credential_fields... OK
  Applying main.0009_v322_add_setting_field_for_activity_stream... OK
  Applying main.0010_v322_add_ovirt4_tower_inventory... OK
  Applying main.0011_v322_encrypt_survey_passwords... OK
  Applying main.0012_v322_update_cred_types... OK
  Applying main.0013_v330_multi_credential... OK
  Applying auth.0002_alter_permission_name_max_length... OK
  Applying auth.0003_alter_user_email_max_length... OK
  Applying auth.0004_alter_user_username_opts... OK
  Applying auth.0005_alter_user_last_login_null... OK
  Applying auth.0006_require_contenttypes_0002... OK
  Applying auth.0007_alter_validators_add_error_messages... OK
  Applying auth.0008_alter_user_username_max_length... OK
  Applying auth.0009_alter_user_last_name_max_length... OK
  Applying auth.0010_alter_group_name_max_length... OK
  Applying auth.0011_update_proxy_permissions... OK
  Applying auth.0012_alter_user_first_name_max_length... OK
  Applying conf.0003_v310_JSONField_changes... OK
  Applying conf.0004_v320_reencrypt... OK
  Applying conf.0005_v330_rename_two_session_settings... OK
  Applying conf.0006_v331_ldap_group_type... OK
  Applying conf.0007_v380_rename_more_settings... OK
  Applying conf.0008_subscriptions... OK
  Applying conf.0009_rename_proot_settings... OK
  Applying conf.0010_change_to_JSONField... OK
  Applying dab_resource_registry.0001_initial... OK
  Applying dab_resource_registry.0002_remove_resource_id... OK
  Applying dab_resource_registry.0003_alter_resource_object_id... OK
  Applying sessions.0001_initial... OK
  Applying main.0014_v330_saved_launchtime_configs... OK
  Applying main.0015_v330_blank_start_args... OK
  Applying main.0016_v330_non_blank_workflow... OK
  Applying main.0017_v330_move_deprecated_stdout... OK
  Applying main.0018_v330_add_additional_stdout_events... OK
  Applying main.0019_v330_custom_virtualenv... OK
  Applying main.0020_v330_instancegroup_policies... OK
  Applying main.0021_v330_declare_new_rbac_roles... OK
  Applying main.0022_v330_create_new_rbac_roles... OK
  Applying main.0023_v330_inventory_multicred... OK
  Applying main.0024_v330_create_user_session_membership... OK
  Applying main.0025_v330_add_oauth_activity_stream_registrar... OK
  Applying oauth2_provider.0001_initial... OK
  Applying oauth2_provider.0002_auto_20190406_1805... OK
  Applying oauth2_provider.0003_auto_20201211_1314... OK
  Applying oauth2_provider.0004_auto_20200902_2022... OK
  Applying oauth2_provider.0005_auto_20211222_2352... OK
  Applying main.0026_v330_delete_authtoken... OK
  Applying main.0027_v330_emitted_events... OK
  Applying main.0028_v330_add_tower_verify... OK
  Applying main.0030_v330_modify_application... OK
  Applying main.0031_v330_encrypt_oauth2_secret... OK
  Applying main.0032_v330_polymorphic_delete... OK
  Applying main.0033_v330_oauth_help_text... OK
2024-03-18 12:37:32,320 INFO     [-] rbac_migrations Computing role roots..
2024-03-18 12:37:32,321 INFO     [-] rbac_migrations Found 0 roots in 0.000113 seconds, rebuilding ancestry map
2024-03-18 12:37:32,321 INFO     [-] rbac_migrations Rebuild ancestors completed in 0.000004 seconds
2024-03-18 12:37:32,321 INFO     [-] rbac_migrations Done.
  Applying main.0034_v330_delete_user_role... OK
  Applying main.0035_v330_more_oauth2_help_text... OK
  Applying main.0036_v330_credtype_remove_become_methods... OK
  Applying main.0037_v330_remove_legacy_fact_cleanup... OK
  Applying main.0038_v330_add_deleted_activitystream_actor... OK
  Applying main.0039_v330_custom_venv_help_text... OK
  Applying main.0040_v330_unifiedjob_controller_node... OK
  Applying main.0041_v330_update_oauth_refreshtoken... OK
2024-03-18 12:37:33,605 INFO     [-] rbac_migrations Computing role roots..
2024-03-18 12:37:33,606 INFO     [-] rbac_migrations Found 0 roots in 0.000108 seconds, rebuilding ancestry map
2024-03-18 12:37:33,606 INFO     [-] rbac_migrations Rebuild ancestors completed in 0.000004 seconds
2024-03-18 12:37:33,606 INFO     [-] rbac_migrations Done.
  Applying main.0042_v330_org_member_role_deparent... OK
  Applying main.0043_v330_oauth2accesstoken_modified... OK
  Applying main.0044_v330_add_inventory_update_inventory... OK
  Applying main.0045_v330_instance_managed_by_policy... OK
  Applying main.0046_v330_remove_client_credentials_grant... OK
  Applying main.0047_v330_activitystream_instance... OK
  Applying main.0048_v330_django_created_modified_by_model_name... OK
  Applying main.0049_v330_validate_instance_capacity_adjustment... OK
  Applying main.0050_v340_drop_celery_tables... OK
  Applying main.0051_v340_job_slicing... OK
  Applying main.0052_v340_remove_project_scm_delete_on_next_update... OK
  Applying main.0053_v340_workflow_inventory... OK
  Applying main.0054_v340_workflow_convergence... OK
  Applying main.0055_v340_add_grafana_notification... OK
  Applying main.0056_v350_custom_venv_history... OK
  Applying main.0057_v350_remove_become_method_type... OK
  Applying main.0058_v350_remove_limit_limit... OK
  Applying main.0059_v350_remove_adhoc_limit... OK
  Applying main.0060_v350_update_schedule_uniqueness_constraint... OK
  Applying main.0061_v350_track_native_credentialtype_source... OK
  Applying main.0062_v350_new_playbook_stats... OK
  Applying main.0063_v350_org_host_limits... OK
  Applying main.0064_v350_analytics_state... OK
  Applying main.0065_v350_index_job_status... OK
  Applying main.0066_v350_inventorysource_custom_virtualenv... OK
  Applying main.0067_v350_credential_plugins... OK
  Applying main.0068_v350_index_event_created... OK
  Applying main.0069_v350_generate_unique_install_uuid... OK
  Applying main.0070_v350_gce_instance_id... OK
  Applying main.0071_v350_remove_system_tracking... OK
  Applying main.0072_v350_deprecate_fields... OK
  Applying main.0073_v360_create_instance_group_m2m... OK
  Applying main.0074_v360_migrate_instance_group_relations... OK
  Applying main.0075_v360_remove_old_instance_group_relations... OK
  Applying main.0076_v360_add_new_instance_group_relations... OK
  Applying main.0077_v360_add_default_orderings... OK
  Applying main.0078_v360_clear_sessions_tokens_jt... OK
  Applying main.0079_v360_rm_implicit_oauth2_apps... OK
  Applying main.0080_v360_replace_job_origin... OK
  Applying main.0081_v360_notify_on_start... OK
  Applying main.0082_v360_webhook_http_method... OK
  Applying main.0083_v360_job_branch_override... OK
  Applying main.0084_v360_token_description... OK
  Applying main.0085_v360_add_notificationtemplate_messages... OK
  Applying main.0086_v360_workflow_approval... OK
  Applying main.0087_v360_update_credential_injector_help_text... OK
  Applying main.0088_v360_dashboard_optimizations... OK
  Applying main.0089_v360_new_job_event_types... OK
  Applying main.0090_v360_WFJT_prompts... OK
  Applying main.0091_v360_approval_node_notifications... OK
  Applying main.0092_v360_webhook_mixin... OK
  Applying main.0093_v360_personal_access_tokens... OK
  Applying main.0094_v360_webhook_mixin2... OK
  Applying main.0095_v360_increase_instance_version_length... OK
  Applying main.0096_v360_container_groups... OK
  Applying main.0097_v360_workflowapproval_approved_or_denied_by... OK
  Applying main.0098_v360_rename_cyberark_aim_credential_type... OK
  Applying main.0099_v361_license_cleanup... OK
  Applying main.0100_v370_projectupdate_job_tags... OK
  Applying main.0101_v370_generate_new_uuids_for_iso_nodes... OK
  Applying main.0102_v370_unifiedjob_canceled... OK
  Applying main.0103_v370_remove_computed_fields... OK
  Applying main.0104_v370_cleanup_old_scan_jts... OK
  Applying main.0105_v370_remove_jobevent_parent_and_hosts... OK
  Applying main.0106_v370_remove_inventory_groups_with_active_failures... OK
  Applying main.0107_v370_workflow_convergence_api_toggle... OK
  Applying main.0108_v370_unifiedjob_dependencies_processed... OK
2024-03-18 12:37:54,433 INFO     [-] rbac_migrations Unified organization migration completed in 0.0183 seconds
2024-03-18 12:37:54,452 INFO     [-] rbac_migrations Unified organization migration completed in 0.0184 seconds
2024-03-18 12:37:55,391 INFO     [-] rbac_migrations Rebuild parentage completed in 0.003237 seconds
  Applying main.0109_v370_job_template_organization_field... OK
  Applying main.0110_v370_instance_ip_address... OK
  Applying main.0111_v370_delete_channelgroup... OK
  Applying main.0112_v370_workflow_node_identifier... OK
  Applying main.0113_v370_event_bigint... OK
  Applying main.0114_v370_remove_deprecated_manual_inventory_sources... OK
  Applying main.0115_v370_schedule_set_null... OK
  Applying main.0116_v400_remove_hipchat_notifications... OK
  Applying main.0117_v400_remove_cloudforms_inventory... OK
  Applying main.0118_add_remote_archive_scm_type... OK
  Applying main.0119_inventory_plugins... OK
  Applying main.0120_galaxy_credentials... OK
  Applying main.0121_delete_toweranalyticsstate... OK
  Applying main.0122_really_remove_cloudforms_inventory... OK
  Applying main.0123_drop_hg_support... OK
  Applying main.0124_execution_environments... OK
  Applying main.0125_more_ee_modeling_changes... OK
  Applying main.0126_executionenvironment_container_options... OK
  Applying main.0127_reset_pod_spec_override... OK
  Applying main.0128_organiaztion_read_roles_ee_admin... OK
  Applying main.0129_unifiedjob_installed_collections... OK
  Applying main.0130_ee_polymorphic_set_null... OK
  Applying main.0131_undo_org_polymorphic_ee... OK
  Applying main.0132_instancegroup_is_container_group... OK
  Applying main.0133_centrify_vault_credtype... OK
  Applying main.0134_unifiedjob_ansible_version... OK
  Applying main.0135_schedule_sort_fallback_to_id... OK
  Applying main.0136_scm_track_submodules... OK
  Applying main.0137_custom_inventory_scripts_removal_data... OK
  Applying main.0138_custom_inventory_scripts_removal... OK
  Applying main.0139_isolated_removal... OK
  Applying main.0140_rename... OK
  Applying main.0141_remove_isolated_instances... OK
  Applying main.0142_update_ee_image_field_description... OK
  Applying main.0143_hostmetric... OK
  Applying main.0144_event_partitions... OK
  Applying main.0145_deregister_managed_ee_objs... OK
  Applying main.0146_add_insights_inventory... OK
  Applying main.0147_validate_ee_image_field... OK
  Applying main.0148_unifiedjob_receptor_unit_id... OK
  Applying main.0149_remove_inventory_insights_credential... OK
  Applying main.0150_rename_inv_sources_inv_updates... OK
  Applying main.0151_rename_managed_by_tower... OK
  Applying main.0152_instance_node_type... OK
  Applying main.0153_instance_last_seen... OK
  Applying main.0154_set_default_uuid... OK
  Applying main.0155_improved_health_check... OK
  Applying main.0156_capture_mesh_topology... OK
  Applying main.0157_inventory_labels... OK
  Applying main.0158_make_instance_cpu_decimal... OK
  Applying main.0159_deprecate_inventory_source_UoPU_field... OK
  Applying main.0160_alter_schedule_rrule... OK
  Applying main.0161_unifiedjob_host_status_counts... OK
  Applying main.0162_alter_unifiedjob_dependent_jobs... OK
  Applying main.0163_convert_job_tags_to_textfield... OK
  Applying main.0164_remove_inventorysource_update_on_project_update... OK
  Applying main.0165_task_manager_refactor... OK
  Applying main.0166_alter_jobevent_host... OK
  Applying main.0167_project_signature_validation_credential... OK
  Applying main.0168_inventoryupdate_scm_revision... OK
  Applying main.0169_jt_prompt_everything_on_launch... OK
  Applying main.0170_node_and_link_state... OK
  Applying main.0171_add_health_check_started... OK
  Applying main.0172_prevent_instance_fallback... OK
  Applying main.0173_instancegroup_max_limits... OK
  Applying main.0174_ensure_org_ee_admin_roles... OK
  Applying main.0175_workflowjob_is_bulk_job... OK
  Applying main.0176_inventorysource_scm_branch... OK
  Applying main.0177_instance_group_role_addition... OK
2024-03-18 12:38:18,686 INFO     [-] awx.main.migrations Initiated migration from Org admin to use role
  Applying main.0178_instance_group_admin_migration... OK
  Applying main.0179_change_cyberark_plugin_names... OK
  Applying main.0180_add_hostmetric_fields... OK
  Applying main.0181_hostmetricsummarymonthly... OK
  Applying main.0182_constructed_inventory... OK
  Applying main.0183_pre_django_upgrade... OK
  Applying main.0184_django_indexes... OK
  Applying main.0185_move_JSONBlob_to_JSONField... OK
  Applying main.0186_drop_django_taggit... OK
  Applying main.0187_hop_nodes... OK
  Applying main.0188_add_bitbucket_dc_webhook... OK
  Applying main.0189_inbound_hop_nodes... OK
  Applying main.0190_alter_inventorysource_source_and_more... OK
  Applying sites.0001_initial... OK
  Applying sites.0002_alter_domain_unique... OK
  Applying social_django.0001_initial... OK
  Applying social_django.0002_add_related_name... OK
  Applying social_django.0003_alter_email_max_length... OK
  Applying social_django.0004_auto_20160423_0400... OK
  Applying social_django.0005_auto_20160727_2333... OK
  Applying social_django.0006_partial... OK
  Applying social_django.0007_code_timestamp... OK
  Applying social_django.0008_partial_timestamp... OK
  Applying social_django.0009_auto_20191118_0520... OK
  Applying social_django.0010_uid_db_index... OK
  Applying social_django.0011_alter_id_fields... OK
  Applying social_django.0012_usersocialauth_extra_data_new... OK
  Applying social_django.0013_migrate_extra_data... OK
  Applying social_django.0014_remove_usersocialauth_extra_data... OK
  Applying social_django.0015_rename_extra_data_new_usersocialauth_extra_data... OK
  Applying sso.0001_initial... OK
  Applying sso.0002_expand_provider_options... OK
  Applying sso.0003_convert_saml_string_to_list... OK

craph commented 8 months ago

No more data 😢 and password have been reinitialized ... Why all the data haven't been migrated but the log says OK ?

kurokobo commented 8 months ago

@craph Could you provide:

kubectl -n <namespace> get pod
kubectl -n <namespace> get pod <psql pod> -o yaml

craph commented 8 months ago

@kurokobo,

> kubectl -n awx get pod
NAME                                               READY   STATUS      RESTARTS   AGE
awx-demo-migration-24.0.0-rc45z                    0/1     Completed   0          41m
awx-demo-postgres-15-0                             1/1     Running     0          42m
awx-demo-task-676cbb9bb5-wm6db                     4/4     Running     0          42m
awx-demo-web-7cfb6d6d8-9f4gs                       3/3     Running     0          42m
awx-operator-controller-manager-865d646cd8-k7ldz   2/2     Running     0          3d5h

and

> kubectl -n awx get pod awx-demo-postgres-15-0 -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    cattle.io/timestamp: "2024-03-18T12:35:22Z"
    cni.projectcalico.org/containerID: c8c74a832a87163b6835a4b62528e56c94e573a778c4494770c94ddc5b825cca
    cni.projectcalico.org/podIP: 10.42.42.169/32
    cni.projectcalico.org/podIPs: 10.42.42.169/32
  creationTimestamp: "2024-03-18T12:35:45Z"
  generateName: awx-demo-postgres-15-
  labels:
    app.kubernetes.io/component: database
    app.kubernetes.io/instance: postgres-15-awx-demo
    app.kubernetes.io/managed-by: awx-operator
    app.kubernetes.io/name: postgres-15
    app.kubernetes.io/part-of: awx-demo
    controller-revision-hash: awx-demo-postgres-15-7fb855c556
    statefulset.kubernetes.io/pod-name: awx-demo-postgres-15-0
  name: awx-demo-postgres-15-0
  namespace: awx
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: StatefulSet
    name: awx-demo-postgres-15
    uid: 335a3c7f-db87-470c-ba3f-04a3e43c1368
  resourceVersion: "218222804"
  uid: 78ddf50b-d558-4105-ab0a-3ae409a0c610
spec:
  containers:
  - env:
    - name: POSTGRESQL_DATABASE
      valueFrom:
        secretKeyRef:
          key: database
          name: awx-demo-postgres-configuration
    - name: POSTGRESQL_USER
      valueFrom:
        secretKeyRef:
          key: username
          name: awx-demo-postgres-configuration
    - name: POSTGRESQL_PASSWORD
      valueFrom:
        secretKeyRef:
          key: password
          name: awx-demo-postgres-configuration
    - name: POSTGRES_DB
      valueFrom:
        secretKeyRef:
          key: database
          name: awx-demo-postgres-configuration
    - name: POSTGRES_USER
      valueFrom:
        secretKeyRef:
          key: username
          name: awx-demo-postgres-configuration
    - name: POSTGRES_PASSWORD
      valueFrom:
        secretKeyRef:
          key: password
          name: awx-demo-postgres-configuration
    - name: PGDATA
      value: /var/lib/pgsql/data/pgdata
    - name: POSTGRES_INITDB_ARGS
      value: --auth-host=scram-sha-256
    - name: POSTGRES_HOST_AUTH_METHOD
      value: scram-sha-256
    image: quay.io/sclorg/postgresql-15-c9s:latest
    imagePullPolicy: IfNotPresent
    name: postgres
    ports:
    - containerPort: 5432
      name: postgres-15
      protocol: TCP
    resources:
      requests:
        cpu: 10m
        memory: 64Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/lib/pgsql/data
      name: postgres-15
      subPath: data
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-gr59c
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostname: awx-demo-postgres-15-0
  nodeName: myk8sw1
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  subdomain: awx-demo
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: postgres-15
    persistentVolumeClaim:
      claimName: postgres-15-awx-demo-postgres-15-0
  - name: kube-api-access-gr59c
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-03-18T12:35:45Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-03-18T12:35:57Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-03-18T12:35:57Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-03-18T12:35:45Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://bdccd6a0896147020f969a6f3830a03c10e78772edb75457d7b8b2cb2b0b34a3
    image: quay.io/sclorg/postgresql-15-c9s:latest
    imageID: quay.io/sclorg/postgresql-15-c9s@sha256:0a88d11f9d15cf10014c25a8dab4be33a1f9b956f4ab1fbb51522ab10c70bdca
    lastState: {}
    name: postgres
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2024-03-18T12:35:57Z"
  hostIP: 10.80.98.196
  phase: Running
  podIP: 10.42.42.169
  podIPs:
  - ip: 10.42.42.169
  qosClass: Burstable
  startTime: "2024-03-18T12:35:45Z"

craph commented 8 months ago

I still have the old postgres 13 pvc, is it possible to redeploy awx-operator in version 2.12.2 to use the old pvc ?

> kubectl get pvc -n awx
NAME                                 STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
postgres-13-awx-demo-postgres-13-0   Bound    pvc-eac8a8d5-6d74-4d37-819b-ee89154cd60a   8Gi        RWO            longhorn       213d
postgres-15-awx-demo-postgres-15-0   Bound    pvc-1ebb7d11-d2f5-4f50-9bc6-ca2e045f6031   8Gi        RWO            longhorn       3d6h

kurokobo commented 8 months ago

@craph Seems the mounted path is correct, so indeed the DB is not migrated and initialized as fresh install. Operator repeats reconciliation loops until the playbook completes without failed tasks, but if the playbook is run without the old statefulsets present, data migration will not occur. I have not been able to check the implementation in detail, but it may be a situation where migration is deemed unnecessary in the next loop, depending on the failed task (not sure).

Anyway, I recommend that you first get a backup of the 13 PVCs in some way: pgdump or just deploy a working pod and make a tar.gz and copy it to hand with a kubectl cp.

I assume that just deploying AWX with 2.12.2 will reuse the old PVCs, but if not, you should be able to get the data back by temporarily setting the kubectl scale ... --replicas=0 for Operator, Task and Web after fresh deployment, then restoring PSQL and setting the replicas back to 1.

craph commented 8 months ago

@kurokobo how can I do a pgdump on the old 13 PVCs ? any advise ?

kurokobo commented 8 months ago

@craph I think that can be done by deploying PSQL pod that mounts existing PVC 🤔 Since the situation is complicated, a safe method is better to avoid data loss due to unforeseen circumstances. Sorry I can't suggest a wonderful magical way to do this.

Tfinn92 commented 8 months ago

@kurokobo how can I do a pgdump on the old 13 PVCs ? any advise ?

Here's my doc internally how to do it (it might be overkill, but I like to be extra sure). I do believe you'll need the postgres-13 container running in some fashion however for this to work.. Regardless, here's my note:

Dump (backup) the database:
Make sure you have a backup of the awx secrets stored somewhere safe.

The three secrets you will want a backup are as follows:

awx-postgres-configuration

awx-admin-password

awx-app-credentials

To get started, exec into your pod. (You may need to specify the namespace if your context isn’t set)

 kubectl exec -it awx-postgres-13-0 -- sh
Take a sql dump of the DB

pg_dump -U awx awx > /root/awx.sql
Exit the pod

ctrl-d until you’re back to the OS

Copy the file out of the pod (You may need to specify the namespace if your context isn’t set)

kubectl cp awx-postgres-13-0:/root/awx.sql /root/awx/awx.sql
Congrats, you have successfully taken a backup of the AWX Database

And here's my note on the restore of that dump in a new pod. (again, might be overkill) We also use rancher to make our lives easier, so if that doesn't apply to you, you might need to modify these instructions a little bit.

Make sure helm is added to the cluster

[Helm | Installing Helm](https://helm.sh/docs/intro/install/) 

helm repo add awx-operator https://ansible.github.io/awx-operator/ 

helm repo update

Make sure you have the awx_helm_vaules.yml file stored in gitlab

helm install awx-operator awx-operator/awx-operator -f awx_helm_values.yml -n awx 

Wait for everything to come up (This might take 3 minutes)

Scale it all down

kubectl scale deployment awx-operator-controller-manager --replicas=0
kubectl scale deployment awx-task --replicas=0
kubectl scale deployment awx-web --replicas=0
Wait for everything to go back down except the Postgres pod

Copy the sql file into the pod. (You may need to specify the namespace if your context isn’t set)

 kubectl cp awx.sql awx-postgres-13-0:/root/awx.sql

Exec into the pod (You may need to specify the namespace if your context isn’t set)

 kubectl exec -it awx-postgres-13-0 -- sh 

WHILE IN POD

Login to the postgres database and drop the existing awx database:

psql -U awx -d postgres

select pg_terminate_backend(pid) from pg_stat_activity where datname='awx' ;

drop database awx ;

create database awx ;

alter database awx owner to awx ;

\q

Restore the database:

psql -U awx -d awx -f awx.sql

Once the restore is complete, enter the awx db and set the password for the awx user

psql -U awx -d awx

\password

ENTER THE PASSWORD FROM THE SECRET

Exit pod with a bunch of ctrl-d until you hit the OS again

Replace your secrets in Rancher

awx-postgres-configuration

awx-admin-password

awx-app-credentials

For good measure, kill the Postgres pod

kubectl delete pods awx-postgres-13-0 -n awx 

Finally, scale your operator, web, and tasks pods back up.

kubectl scale deployment awx-operator-controller-manager --replicas=1
kubectl scale deployment awx-task --replicas=3
kubectl scale deployment awx-web --replicas=3

Bonus Step

If you try to login with the admin password and it refuses, you might need to exec into a web pod and run the following:

awx-manage changepassword admin

craph commented 8 months ago

Make sure you have a backup of the awx secrets stored somewhere saf

Thank you very much for the update @Tfinn92 but I don't have anymore the awx-postgres-13-0 pod running because I fix the permission issue for postgres 15 and then the pod for postgres 13 doesn't exist anymore :/

Tfinn92 commented 8 months ago

Thank you very much for the update @Tfinn92 but I don't have anymore the awx-postgres-13-0 pod running because I fix the permission issue for postgres 15 and then the pod for postgres 13 doesn't exist anymore :/

As long as the pv still exists, you should be able to uninstall awx-operator (do not delete the namespace) and install the old one over it.. It should reattach the pv to the postgres 13 pod. That being said, I think you're still going to run into issues with the web/task pods coming up correctly. I've noticed with the change to pg15, the secrets change to reflect it.

For instance, the secret awx-app-credentials changes the line host: awx-posgres-13 to 15. Same with the secret awx-postgres-configuration so you'll need to change both of those back

craph commented 8 months ago

@kurokobo how can I do a pgdump on the old 13 PVCs ? any advise ?

Here's my doc internally how to do it (it might be overkill, but I like to be extra sure). I do believe you'll need the postgres-13 container running in some fashion however for this to work.. Regardless, here's my note:

Dump (backup) the database:
Make sure you have a backup of the awx secrets stored somewhere safe.

The three secrets you will want a backup are as follows:

awx-postgres-configuration

awx-admin-password

awx-app-credentials

To get started, exec into your pod. (You may need to specify the namespace if your context isn’t set)

 kubectl exec -it awx-postgres-13-0 -- sh
Take a sql dump of the DB

pg_dump -U awx awx > /root/awx.sql
Exit the pod

ctrl-d until you’re back to the OS

Copy the file out of the pod (You may need to specify the namespace if your context isn’t set)

kubectl cp awx-postgres-13-0:/root/awx.sql /root/awx/awx.sql
Congrats, you have successfully taken a backup of the AWX Database

And here's my note on the restore of that dump in a new pod. (again, might be overkill) We also use rancher to make our lives easier, so if that doesn't apply to you, you might need to modify these instructions a little bit.

Make sure helm is added to the cluster

[Helm | Installing Helm](https://helm.sh/docs/intro/install/) 

helm repo add awx-operator https://ansible.github.io/awx-operator/ 

helm repo update

Make sure you have the awx_helm_vaules.yml file stored in gitlab

helm install awx-operator awx-operator/awx-operator -f awx_helm_values.yml -n awx 

Wait for everything to come up (This might take 3 minutes)

Scale it all down

kubectl scale deployment awx-operator-controller-manager --replicas=0
kubectl scale deployment awx-task --replicas=0
kubectl scale deployment awx-web --replicas=0
Wait for everything to go back down except the Postgres pod

Copy the sql file into the pod. (You may need to specify the namespace if your context isn’t set)

 kubectl cp awx.sql awx-postgres-13-0:/root/awx.sql

Exec into the pod (You may need to specify the namespace if your context isn’t set)

 kubectl exec -it awx-postgres-13-0 -- sh 

WHILE IN POD

Login to the postgres database and drop the existing awx database:

psql -U awx -d postgres

select pg_terminate_backend(pid) from pg_stat_activity where datname='awx' ;

drop database awx ;

create database awx ;

alter database awx owner to awx ;

\q

Restore the database:

psql -U awx -d awx -f awx.sql

Once the restore is complete, enter the awx db and set the password for the awx user

psql -U awx -d awx

\password

ENTER THE PASSWORD FROM THE SECRET

Exit pod with a bunch of ctrl-d until you hit the OS again

Replace your secrets in Rancher

awx-postgres-configuration

awx-admin-password

awx-app-credentials

For good measure, kill the Postgres pod

kubectl delete pods awx-postgres-13-0 -n awx 

Finally, scale your operator, web, and tasks pods back up.

kubectl scale deployment awx-operator-controller-manager --replicas=1
kubectl scale deployment awx-task --replicas=3
kubectl scale deployment awx-web --replicas=3

Bonus Step

If you try to login with the admin password and it refuses, you might need to exec into a web pod and run the following:

awx-manage changepassword admin

Thank your very much @Tfinn92 for your documentation. Did you try to use https://github.com/ansible/awx-operator/tree/devel/roles/backup ?

Tfinn92 commented 8 months ago

@kurokobo how can I do a pgdump on the old 13 PVCs ? any advise ?

Here's my doc internally how to do it (it might be overkill, but I like to be extra sure). I do believe you'll need the postgres-13 container running in some fashion however for this to work.. Regardless, here's my note:

Dump (backup) the database:
Make sure you have a backup of the awx secrets stored somewhere safe.

The three secrets you will want a backup are as follows:

awx-postgres-configuration

awx-admin-password

awx-app-credentials

To get started, exec into your pod. (You may need to specify the namespace if your context isn’t set)

 kubectl exec -it awx-postgres-13-0 -- sh
Take a sql dump of the DB

pg_dump -U awx awx > /root/awx.sql
Exit the pod

ctrl-d until you’re back to the OS

Copy the file out of the pod (You may need to specify the namespace if your context isn’t set)

kubectl cp awx-postgres-13-0:/root/awx.sql /root/awx/awx.sql
Congrats, you have successfully taken a backup of the AWX Database

And here's my note on the restore of that dump in a new pod. (again, might be overkill) We also use rancher to make our lives easier, so if that doesn't apply to you, you might need to modify these instructions a little bit.

Make sure helm is added to the cluster

[Helm | Installing Helm](https://helm.sh/docs/intro/install/) 

helm repo add awx-operator https://ansible.github.io/awx-operator/ 

helm repo update

Make sure you have the awx_helm_vaules.yml file stored in gitlab

helm install awx-operator awx-operator/awx-operator -f awx_helm_values.yml -n awx 

Wait for everything to come up (This might take 3 minutes)

Scale it all down

kubectl scale deployment awx-operator-controller-manager --replicas=0
kubectl scale deployment awx-task --replicas=0
kubectl scale deployment awx-web --replicas=0
Wait for everything to go back down except the Postgres pod

Copy the sql file into the pod. (You may need to specify the namespace if your context isn’t set)

 kubectl cp awx.sql awx-postgres-13-0:/root/awx.sql

Exec into the pod (You may need to specify the namespace if your context isn’t set)

 kubectl exec -it awx-postgres-13-0 -- sh 

WHILE IN POD

Login to the postgres database and drop the existing awx database:

psql -U awx -d postgres

select pg_terminate_backend(pid) from pg_stat_activity where datname='awx' ;

drop database awx ;

create database awx ;

alter database awx owner to awx ;

\q

Restore the database:

psql -U awx -d awx -f awx.sql

Once the restore is complete, enter the awx db and set the password for the awx user

psql -U awx -d awx

\password

ENTER THE PASSWORD FROM THE SECRET

Exit pod with a bunch of ctrl-d until you hit the OS again

Replace your secrets in Rancher

awx-postgres-configuration

awx-admin-password

awx-app-credentials

For good measure, kill the Postgres pod

kubectl delete pods awx-postgres-13-0 -n awx 

Finally, scale your operator, web, and tasks pods back up.

kubectl scale deployment awx-operator-controller-manager --replicas=1
kubectl scale deployment awx-task --replicas=3
kubectl scale deployment awx-web --replicas=3

Bonus Step

If you try to login with the admin password and it refuses, you might need to exec into a web pod and run the following:

awx-manage changepassword admin

Thank your very much @Tfinn92 for your documentation. Did you try to use https://github.com/ansible/awx-operator/tree/devel/roles/backup ?

I did not, no. I need my backups for DR situations, and having the backup living in the same namespace on the same cluster as the instance doesn't work in that scenario. Sorry :(

nan0viol3t commented 8 months ago

Is there any ongoing "patch" regarding this issue? I have AWX deployment with kustomize and of course situation is the same with ""mkdir: cannot create directory '/var/lib/pgsql/data/userdata': Permission denied"". Its also same for new installation. I cant simple move back to previous versions as I have remote Postgre DB instance where I did manual upgrade to 15.x. already/before upgrade of operator. It seems its a BUG, as permissions can not be simple correct and having "pure" AWX operator deployment will fail. Should I seek for manual workaround or this permissions issue will be addressed eventually? With RESPECT to whole AWX DEV's team.

Rory-Z commented 8 months ago

I think a kind of solution is add securityContext in pod spec, like this:

kind: pod
spec:
  securityContext:
    runAsUser:26
    runAsGroup:26
    fsGroup:26
    fsGroupChangePolicy:Always
    supplementalGroups:
    - 26

but I don't know how to add this in awx-postgre pod by awx CR

rooftopcellist commented 8 months ago

Could you give this PR a try and see if it solves your issue?

https://github.com/ansible/awx-operator/pull/1799

RaceFPV commented 8 months ago

You can recover and rollback to version 2.12.2 if your postgresql 13 statefulset is still online and you edit the secret: 'awx-postgres-configuration' 'host: awx-postgres-15' to 'host: awx-postgres-13' after changing back the version in helm. You may need to restart your pods after doing so

kzinas-adv commented 8 months ago

Fresh install of awx-operator 2.14.0 still got this issue

rooftopcellist commented 7 months ago

Was anyone able to test the PR I linked?

I am unable to reproduce this issue on Openshift and minikube. Could someone who is seeing this issue please share their k8s cluster type, cluster version, awx-operator version, storage class, and cloud provider used if applicable?

wonkyooh commented 7 months ago

k8s cluster type: on-prem cluster version

Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.5", GitCommit:"93e0d7146fb9c3e9f68aa41b2b4265b2fcdb0a4c", GitTreeState:"clean", BuildDate:"2023-08-24T00:48:26Z", GoVersion:"go1.20.7", Compiler:"gc", Platform:"linux/amd64"} Kustomize Version: v5.0.1 Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.5", GitCommit:"93e0d7146fb9c3e9f68aa41b2b4265b2fcdb0a4c", GitTreeState:"clean", BuildDate:"2023-08-24T00:42:11Z", GoVersion:"go1.20.7", Compiler:"gc", Platform:"linux/amd64"}

awx-operator version: quay.io/ansible/awx-operator:2.13.1 storageclass: rook-cephfs cloud provider : N/A

solved this issue by adding

postgres_security_context_settings:
  fsGroup: 26

option to AWX CR (cc. @Rory-Z)

if you have already deployed it try editing the postgres statefulset and add fsGroup: 26 to securitycontext

kurokobo commented 7 months ago

The default permissions and owners of PVs and their subPaths depend on the storage provisioner implementation for the storage class. Also, securityContext.fsGroups may not be valid in all environments, as it is ignored for some types of PVs, such as hostPath and nfs, etc.

@rooftopcellist The default storage provisioner for minikube creates directories with 777 for PVC so this issue can't be reproduced. It should be possible to reproduce this if explicitly configured to use hostPath on minikube:

Create /data/demo on minikube intance (in docker container or VM, depends on your driver)

Create PV

---
apiVersion: v1
kind: PersistentVolume
metadata:
name: awx-postgres-15-volume
spec:
accessModes:
  - ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
capacity:
  storage: 8Gi
storageClassName: awx-postgres-volume
hostPath:
  path: /data/demo

Create AWX CR with postgres_storage_class: awx-postgres-volume

Or following my guide with ignoring chown and chmod for /data/postgres-15/data can also reproduce this.

I've made minimal tests on #1799 and I can confirm that once my comments in #1799 are resolved, it appears to work as expected.

hhk7734 commented 7 months ago

@wonkyooh

security_context_settings is for web and task PodSecurityContext(pod.spec.securityContext). but postgres_security_context_settings is for SecurityContext in postgresql container(pod.spec.containers.securityContext). It confuses users.

When I added postgres_security_context_settings: {"fsGroup":26} to AWX CR, it was ignored.

craph commented 7 months ago

Was anyone able to test the PR I linked?

I am unable to reproduce this issue on Openshift and minikube. Could someone who is seeing this issue please share their k8s cluster type, cluster version, awx-operator version, storage class, and cloud provider used if applicable?

@rooftopcellist you have all the details here too if needed : https://github.com/ansible/awx-operator/issues/1775#issuecomment-1999830535

AWX Operator version 2.13.1

AWX version 24.0.0 Kubernetes platform kubernetes selfhosted with Rancher

Kubernetes/Platform version v1.25.16+rke2r1

Storage Class Longhorn

Upgrade from 2.12.2 to 2.13.1

kennethacurtis commented 7 months ago

I'm also getting this issue when going from 2.10.0 to 2.14.0. I'm using AKS.

@rooftopcellist here are my details

store class (default in this case means Azure Disk):

$ kubectl get pvc postgres-13-awx-postgres-13-0 -o jsonpath='{.spec.storageClassName}' -n awx
default

When doing an upgrade, the postgres 15 pod crashes:

kubectl get pods -n awx
NAME                                              READY   STATUS             RESTARTS      AGE
awx-operator-controller-manager-cb46cc5dd-qv5db   2/2     Running            0             13m
awx-postgres-13-0                                 1/1     Running            0             3d23h
awx-postgres-15-0                                 0/1     CrashLoopBackOff   7 (45s ago)   12m

Logs in the postgres 15 pod:

kubectl logs awx-postgres-15-0 -n awx
mkdir: cannot create directory '/var/lib/pgsql/data/userdata': Permission denied

Here are my deployment details. Kustomization file (when trying to upgrade to 2.14.0 from 2.10.0:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  # Find the latest tag here: https://github.com/ansible/awx-operator/releases
  - github.com/ansible/awx-operator/config/default?ref=2.14.0
  - awx.yml

# Set the image tags to match the git version from above
images:
  - name: quay.io/ansible/awx-operator
    newTag: 2.14.0

# Specify a custom namespace in which to install AWX
namespace: awx

And here's my awx.yml file. Im using the AGIC:

apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
  name: awx
  labels:
    app: awx
spec:
  service_type: clusterip
  ingress_type: ingress
  ingress_path: /
  ingress_path_type: Exact
  ingress_tls_secret: tlssecret
  hostname: awx.example.org
  projects_storage_size: 500Gi
  ingress_annotations: |
    kubernetes.io/ingress.class: azure/application-gateway
    appgw.ingress.kubernetes.io/appgw-ssl-certificate: tlssecret
    appgw.ingress.kubernetes.io/health-probe-path: /api/v2/ping
    appgw.ingress.kubernetes.io/backend-protocol: http
    appgw.ingress.kubernetes.io/backend-hostname: awx.example.org

---
apiVersion: v1
kind: Service
metadata:
  name: awx-service
spec:
  selector:
    app: awx
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8052

---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  labels:
    app: awx
  name: awx-ingress
  annotations:
    kubernetes.io/ingress.class: azure/application-gateway
    appgw.ingress.kubernetes.io/appgw-ssl-certificate: tlssecret
    appgw.ingress.kubernetes.io/health-probe-path: /api/v2/ping
    appgw.ingress.kubernetes.io/backend-protocol: http
    appgw.ingress.kubernetes.io/backend-hostname: awx.example.org
spec:
  rules:
    - host: awx.example.org
      http:
        paths:
          - path: /
            backend:
              service:
                name: awx
                port:
                  number: 80
            pathType: Exact

One thing I did notice is that when the pvc is created for postgres 15, it doesn't allocate the correct amount of storage specified for projects_storage_size, not sure if that is related or not.

kubectl get pvc -n awx
NAME                            STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
postgres-13-awx-postgres-13-0   Bound    pvc-a69fc0f3-929b-4ba8-8c72-8ca1ad15b8af   500Gi      RWO            default        166d
postgres-15-awx-postgres-15-0   Bound    pvc-114e9ae9-9376-496c-b59d-edbe8b5ce4d5   8Gi        RWO            default        20m

I was able to recover by deleting AWX , the postgres 15 pod and pvc, and redeploying with operator 2.10

rooftopcellist commented 7 months ago

Please weight in on which PR approach you like better:

https://github.com/ansible/awx-operator/pull/1805

spec:
  postgres_data_volume_init: true

https://github.com/ansible/awx-operator/pull/1799

  init_postgres_extra_commands: |
    chown 26:0 /var/lib/pgsql/data
    chmod 700 /var/lib/pgsql/data

kurokobo commented 7 months ago

+1 for postgres_data_volume_init

craph commented 7 months ago

Please weight in on which PR approach you like better:

Add postgres init container to resolve permissions for some k3s deployments #1805
spec:
  postgres_data_volume_init: true
Add postgres init container #1799
  init_postgres_extra_commands: |
    chown 26:0 /var/lib/pgsql/data
    chmod 700 /var/lib/pgsql/data 

👍 PR #1805 will provide a better user experience I think

rooftopcellist commented 7 months ago

Thanks for weighing in all and for the review of the PR. There is one more potential issue to resolve because of the removal of the postgres_init_container_resource_requirements parameter. More details on the PR.

rooftopcellist commented 7 months ago

This was resolved by https://github.com/ansible/awx-operator/pull/1805, which just merged.

daneov commented 7 months ago

Awesome work. I'm hitting this as well. I'm using Kustomize, but referring to the commit sha doesn't seem to change anything.

Any tips on how to include this fix without manual fiddling in the cluster?

fubz commented 7 months ago

How does one fix their environment if they already went to version 2.12. I waited for 2.15 in hopes that Operator would fix the issue; however, the environment is currently down due to this issue and am unsure how to correct it. What steps need to be done to correct the broken environment. I see some mentions of init_postgres_extra_commands but am unsure of where values to this parameter need to be placed.

miki-akamai commented 7 months ago

I had same issue, you need to spawn following pod:

apiVersion: v1
kind: Pod
metadata:
  name: pvc-inspector
  namespace: awx-prod
spec:
  containers:
  - image: busybox
    name: pvc-inspector
    command: ["tail"]
    args: ["-f", "/dev/null"]
    volumeMounts:
    - mountPath: /pvc
      name: pvc-mount
  volumes:
  - name: pvc-mount
    persistentVolumeClaim:
      claimName: postgres-15-awx-postgres-15-0

shell to it and run chown -R 26:26 /pvc/data/

Later on you will also need to update CRDs by kubectl apply -n 'awx-prod' --server-side -k "github.com/ansible/awx-operator/config/crd?ref=2.15.0" --force-conflicts

nan0viol3t commented 7 months ago

Having same issue with Postgre 15 pod, in time of troubleshooting, by accident I remove whole namespace (by executing "kustomize delete -k ."). I noticed that later by troubleshooting postgre db connectivity problems, that kustomize is also deleting namspace itself.

My task pods wont start and web is saying: "awx.main.utils.encryption Failed to decrypt.... ....check that you 'SECRET_KEY' value is correct".

I'm sure that "awx-app-secret-key" was rewritten by kustomize execution and I dont have backup of old secret. I can connect to postgre DB instance and to AWX db as well, but have no valid awx-secret-key.

Is there a way to retrieve it from DB itself or it is not store there anywhere? In other words, is this instance lost by loosing "awx-secret-key" ??

ansible / awx-operator