Closed Tfinn92 closed 7 months ago
@Tfinn92 New PSQL is running as UID:26, so PV has to be writable by this UID.
@TheRealHaoLiu @fosterseth The same consideration can be applied for backup. UID:26 has to have write perm on PV to create backup directory.
Same problem here
@kurokobo where should that modification be made? through a root user init container or is there something that could be done when setting up the PV
@kurokobo I think your advice is only valid if you use PV on local storage. Since I use rook-ceph, I can't set rights on the filesystem. Note : it was working with 2.13.0
i had to create the volume, scale down the deployment / statefulset, mount volume into another pod and do
mkdir userdata chown 26:26 userdata
after the pod started and upgrade continued.
init container to run as root user and does chown?
@Tfinn92 New PSQL is running as UID:26, so PV has to be writable by this UID.
While this is true, the old postgres 13 container that was deployed by the operator before was using root as it's user, so it seems like the devs got used to that freedom and tried applying the same logic in the 15 container, which as we are seeing, fails.
Heck, even looking in the PG13 container, the permissions expected now wouldn't be possible without the root user:
Obviously the pathing is a little different as well, but I imagine the same principles could be applied to the PG15 container
This issue is not just for updates. I'm trying to start a new AWX instance from scratch and ran into the same problem.
I confirm, this was also a new install for me.
@mooky31 @jyanesancert @Tfinn92 maybe something like that could help?
you can deploy image quay.io/fosterseth/awx-operator:postgres_init
which has that change
to use, add whatever commands you want to your awx spec, e.g.
init_postgres_extra_commands: |
sudo touch /var/lib/pgsql/data/foo
sudo touch /var/lib/pgsql/data/bar
chown 26:26 /var/lib/pgsql/data/foo
chown root:root /var/lib/pgsql/data/bar
so in your case maybe mkdir /var/lib/pgsql/data/userdata
and chmod / chown it for user 26
IF that works for you let me know and we can get this change into devel
For the new install adding this to spec fix it for me. It suppose to be the default in pervious version. postgres_data_path: /var/lib/postgresql/data/pgdata The install went through, but the postgress is still use /var/lib/pgsql/data/userdata which is not the pv.
@TheRealHaoLiu Sorry for my delayed response.
where should that modification be made? through a root user init container or is there something that could be done when setting up the PV
Since the images under sclorg are mostly maintained by Red Hat, so I think Red Hat should have best practices on this matter as well, rather than me đ Anyway as @fosterseth is trying using init container with root is possible solution.
Another well-known non-root PSQL implementation is Bitnami by VMware which has almost same restriction:
NOTE: As this is a non-root container, the mounted files and directories must have the proper permissions for the UID 1001. https://hub.docker.com/r/bitnami/postgresql/
In their charts for this PSQL, there are params to control initContainer
to invoke chown
/ mkdir
/ chmod
. If we enable this, PSQL has initContainer
with runAsUser: 0
by default.
$ helm install bitnami/postgresql --generate-name --set volumePermissions.enabled=true
...
$ kubectl get statefulset postgresql-1710598237 -o yaml
...
initContainers:
- command:
- /bin/sh
- -ec
- |
chown 1001:1001 /bitnami/postgresql
mkdir -p /bitnami/postgresql/data
chmod 700 /bitnami/postgresql/data
find /bitnami/postgresql -mindepth 1 -maxdepth 1 -not -name "conf" -not -name ".snapshot" -not -name "lost+found" | \
xargs -r chown -R 1001:1001
chmod -R 777 /dev/shm
image: docker.io/bitnami/os-shell:12-debian-12-r16
imagePullPolicy: IfNotPresent
name: init-chmod-data
resources: {}
securityContext:
runAsGroup: 0
runAsNonRoot: false
runAsUser: 0
seccompProfile:
type: RuntimeDefault
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /tmp
name: empty-dir
subPath: tmp-dir
- mountPath: /bitnami/postgresql
name: data
- mountPath: /dev/shm
name: dshm
...
There are related docs by Bitnami:
How can we solve this kind of issue when default storage class is longhorn ? đ¤
@craph Try a workaround https://github.com/ansible/awx-operator/issues/1770#issuecomment-2000166172 by @fosterseth, or deploy temporary working pod that mounts the same PVC for PSQL for AWX and modify permissions.
Hi @kurokobo ,
Thank you very much for the update
@craph Try a workaround #1770 (comment) by @fosterseth, or deploy temporary working pod that mounts the same PVC for PSQL for AWX and modify permissions.
I have just created a new temporary pod, and change the permission on data as requested.
But it look like the previous data haven't been migrated. I can't login anymore in AWX
I can see a job : awx-demo-migration and a pod awx-demo-migration-24.0.0 with state completed. but I can't login anymore
And the old StatefulSets for postgres 13 doesn't exist anymore but I still have the old pvc for postgres 13
@kurokobo here is the log of the migration. But now, I can't login into my AWX instance
Operations to perform:
Apply all migrations: auth, conf, contenttypes, dab_resource_registry, main, oauth2_provider, sessions, sites, social_django, sso
Running migrations:
Applying contenttypes.0001_initial... OK
Applying contenttypes.0002_remove_content_type_name... OK
Applying auth.0001_initial... OK
Applying main.0001_initial... OK
Applying main.0002_squashed_v300_release... OK
Applying main.0003_squashed_v300_v303_updates... OK
Applying main.0004_squashed_v310_release... OK
Applying conf.0001_initial... OK
Applying conf.0002_v310_copy_tower_settings... OK
Applying main.0005_squashed_v310_v313_updates... OK
Applying main.0006_v320_release... OK
Applying main.0007_v320_data_migrations... OK
Applying main.0008_v320_drop_v1_credential_fields... OK
Applying main.0009_v322_add_setting_field_for_activity_stream... OK
Applying main.0010_v322_add_ovirt4_tower_inventory... OK
Applying main.0011_v322_encrypt_survey_passwords... OK
Applying main.0012_v322_update_cred_types... OK
Applying main.0013_v330_multi_credential... OK
Applying auth.0002_alter_permission_name_max_length... OK
Applying auth.0003_alter_user_email_max_length... OK
Applying auth.0004_alter_user_username_opts... OK
Applying auth.0005_alter_user_last_login_null... OK
Applying auth.0006_require_contenttypes_0002... OK
Applying auth.0007_alter_validators_add_error_messages... OK
Applying auth.0008_alter_user_username_max_length... OK
Applying auth.0009_alter_user_last_name_max_length... OK
Applying auth.0010_alter_group_name_max_length... OK
Applying auth.0011_update_proxy_permissions... OK
Applying auth.0012_alter_user_first_name_max_length... OK
Applying conf.0003_v310_JSONField_changes... OK
Applying conf.0004_v320_reencrypt... OK
Applying conf.0005_v330_rename_two_session_settings... OK
Applying conf.0006_v331_ldap_group_type... OK
Applying conf.0007_v380_rename_more_settings... OK
Applying conf.0008_subscriptions... OK
Applying conf.0009_rename_proot_settings... OK
Applying conf.0010_change_to_JSONField... OK
Applying dab_resource_registry.0001_initial... OK
Applying dab_resource_registry.0002_remove_resource_id... OK
Applying dab_resource_registry.0003_alter_resource_object_id... OK
Applying sessions.0001_initial... OK
Applying main.0014_v330_saved_launchtime_configs... OK
Applying main.0015_v330_blank_start_args... OK
Applying main.0016_v330_non_blank_workflow... OK
Applying main.0017_v330_move_deprecated_stdout... OK
Applying main.0018_v330_add_additional_stdout_events... OK
Applying main.0019_v330_custom_virtualenv... OK
Applying main.0020_v330_instancegroup_policies... OK
Applying main.0021_v330_declare_new_rbac_roles... OK
Applying main.0022_v330_create_new_rbac_roles... OK
Applying main.0023_v330_inventory_multicred... OK
Applying main.0024_v330_create_user_session_membership... OK
Applying main.0025_v330_add_oauth_activity_stream_registrar... OK
Applying oauth2_provider.0001_initial... OK
Applying oauth2_provider.0002_auto_20190406_1805... OK
Applying oauth2_provider.0003_auto_20201211_1314... OK
Applying oauth2_provider.0004_auto_20200902_2022... OK
Applying oauth2_provider.0005_auto_20211222_2352... OK
Applying main.0026_v330_delete_authtoken... OK
Applying main.0027_v330_emitted_events... OK
Applying main.0028_v330_add_tower_verify... OK
Applying main.0030_v330_modify_application... OK
Applying main.0031_v330_encrypt_oauth2_secret... OK
Applying main.0032_v330_polymorphic_delete... OK
Applying main.0033_v330_oauth_help_text... OK
2024-03-18 12:37:32,320 INFO [-] rbac_migrations Computing role roots..
2024-03-18 12:37:32,321 INFO [-] rbac_migrations Found 0 roots in 0.000113 seconds, rebuilding ancestry map
2024-03-18 12:37:32,321 INFO [-] rbac_migrations Rebuild ancestors completed in 0.000004 seconds
2024-03-18 12:37:32,321 INFO [-] rbac_migrations Done.
Applying main.0034_v330_delete_user_role... OK
Applying main.0035_v330_more_oauth2_help_text... OK
Applying main.0036_v330_credtype_remove_become_methods... OK
Applying main.0037_v330_remove_legacy_fact_cleanup... OK
Applying main.0038_v330_add_deleted_activitystream_actor... OK
Applying main.0039_v330_custom_venv_help_text... OK
Applying main.0040_v330_unifiedjob_controller_node... OK
Applying main.0041_v330_update_oauth_refreshtoken... OK
2024-03-18 12:37:33,605 INFO [-] rbac_migrations Computing role roots..
2024-03-18 12:37:33,606 INFO [-] rbac_migrations Found 0 roots in 0.000108 seconds, rebuilding ancestry map
2024-03-18 12:37:33,606 INFO [-] rbac_migrations Rebuild ancestors completed in 0.000004 seconds
2024-03-18 12:37:33,606 INFO [-] rbac_migrations Done.
Applying main.0042_v330_org_member_role_deparent... OK
Applying main.0043_v330_oauth2accesstoken_modified... OK
Applying main.0044_v330_add_inventory_update_inventory... OK
Applying main.0045_v330_instance_managed_by_policy... OK
Applying main.0046_v330_remove_client_credentials_grant... OK
Applying main.0047_v330_activitystream_instance... OK
Applying main.0048_v330_django_created_modified_by_model_name... OK
Applying main.0049_v330_validate_instance_capacity_adjustment... OK
Applying main.0050_v340_drop_celery_tables... OK
Applying main.0051_v340_job_slicing... OK
Applying main.0052_v340_remove_project_scm_delete_on_next_update... OK
Applying main.0053_v340_workflow_inventory... OK
Applying main.0054_v340_workflow_convergence... OK
Applying main.0055_v340_add_grafana_notification... OK
Applying main.0056_v350_custom_venv_history... OK
Applying main.0057_v350_remove_become_method_type... OK
Applying main.0058_v350_remove_limit_limit... OK
Applying main.0059_v350_remove_adhoc_limit... OK
Applying main.0060_v350_update_schedule_uniqueness_constraint... OK
Applying main.0061_v350_track_native_credentialtype_source... OK
Applying main.0062_v350_new_playbook_stats... OK
Applying main.0063_v350_org_host_limits... OK
Applying main.0064_v350_analytics_state... OK
Applying main.0065_v350_index_job_status... OK
Applying main.0066_v350_inventorysource_custom_virtualenv... OK
Applying main.0067_v350_credential_plugins... OK
Applying main.0068_v350_index_event_created... OK
Applying main.0069_v350_generate_unique_install_uuid... OK
Applying main.0070_v350_gce_instance_id... OK
Applying main.0071_v350_remove_system_tracking... OK
Applying main.0072_v350_deprecate_fields... OK
Applying main.0073_v360_create_instance_group_m2m... OK
Applying main.0074_v360_migrate_instance_group_relations... OK
Applying main.0075_v360_remove_old_instance_group_relations... OK
Applying main.0076_v360_add_new_instance_group_relations... OK
Applying main.0077_v360_add_default_orderings... OK
Applying main.0078_v360_clear_sessions_tokens_jt... OK
Applying main.0079_v360_rm_implicit_oauth2_apps... OK
Applying main.0080_v360_replace_job_origin... OK
Applying main.0081_v360_notify_on_start... OK
Applying main.0082_v360_webhook_http_method... OK
Applying main.0083_v360_job_branch_override... OK
Applying main.0084_v360_token_description... OK
Applying main.0085_v360_add_notificationtemplate_messages... OK
Applying main.0086_v360_workflow_approval... OK
Applying main.0087_v360_update_credential_injector_help_text... OK
Applying main.0088_v360_dashboard_optimizations... OK
Applying main.0089_v360_new_job_event_types... OK
Applying main.0090_v360_WFJT_prompts... OK
Applying main.0091_v360_approval_node_notifications... OK
Applying main.0092_v360_webhook_mixin... OK
Applying main.0093_v360_personal_access_tokens... OK
Applying main.0094_v360_webhook_mixin2... OK
Applying main.0095_v360_increase_instance_version_length... OK
Applying main.0096_v360_container_groups... OK
Applying main.0097_v360_workflowapproval_approved_or_denied_by... OK
Applying main.0098_v360_rename_cyberark_aim_credential_type... OK
Applying main.0099_v361_license_cleanup... OK
Applying main.0100_v370_projectupdate_job_tags... OK
Applying main.0101_v370_generate_new_uuids_for_iso_nodes... OK
Applying main.0102_v370_unifiedjob_canceled... OK
Applying main.0103_v370_remove_computed_fields... OK
Applying main.0104_v370_cleanup_old_scan_jts... OK
Applying main.0105_v370_remove_jobevent_parent_and_hosts... OK
Applying main.0106_v370_remove_inventory_groups_with_active_failures... OK
Applying main.0107_v370_workflow_convergence_api_toggle... OK
Applying main.0108_v370_unifiedjob_dependencies_processed... OK
2024-03-18 12:37:54,433 INFO [-] rbac_migrations Unified organization migration completed in 0.0183 seconds
2024-03-18 12:37:54,452 INFO [-] rbac_migrations Unified organization migration completed in 0.0184 seconds
2024-03-18 12:37:55,391 INFO [-] rbac_migrations Rebuild parentage completed in 0.003237 seconds
Applying main.0109_v370_job_template_organization_field... OK
Applying main.0110_v370_instance_ip_address... OK
Applying main.0111_v370_delete_channelgroup... OK
Applying main.0112_v370_workflow_node_identifier... OK
Applying main.0113_v370_event_bigint... OK
Applying main.0114_v370_remove_deprecated_manual_inventory_sources... OK
Applying main.0115_v370_schedule_set_null... OK
Applying main.0116_v400_remove_hipchat_notifications... OK
Applying main.0117_v400_remove_cloudforms_inventory... OK
Applying main.0118_add_remote_archive_scm_type... OK
Applying main.0119_inventory_plugins... OK
Applying main.0120_galaxy_credentials... OK
Applying main.0121_delete_toweranalyticsstate... OK
Applying main.0122_really_remove_cloudforms_inventory... OK
Applying main.0123_drop_hg_support... OK
Applying main.0124_execution_environments... OK
Applying main.0125_more_ee_modeling_changes... OK
Applying main.0126_executionenvironment_container_options... OK
Applying main.0127_reset_pod_spec_override... OK
Applying main.0128_organiaztion_read_roles_ee_admin... OK
Applying main.0129_unifiedjob_installed_collections... OK
Applying main.0130_ee_polymorphic_set_null... OK
Applying main.0131_undo_org_polymorphic_ee... OK
Applying main.0132_instancegroup_is_container_group... OK
Applying main.0133_centrify_vault_credtype... OK
Applying main.0134_unifiedjob_ansible_version... OK
Applying main.0135_schedule_sort_fallback_to_id... OK
Applying main.0136_scm_track_submodules... OK
Applying main.0137_custom_inventory_scripts_removal_data... OK
Applying main.0138_custom_inventory_scripts_removal... OK
Applying main.0139_isolated_removal... OK
Applying main.0140_rename... OK
Applying main.0141_remove_isolated_instances... OK
Applying main.0142_update_ee_image_field_description... OK
Applying main.0143_hostmetric... OK
Applying main.0144_event_partitions... OK
Applying main.0145_deregister_managed_ee_objs... OK
Applying main.0146_add_insights_inventory... OK
Applying main.0147_validate_ee_image_field... OK
Applying main.0148_unifiedjob_receptor_unit_id... OK
Applying main.0149_remove_inventory_insights_credential... OK
Applying main.0150_rename_inv_sources_inv_updates... OK
Applying main.0151_rename_managed_by_tower... OK
Applying main.0152_instance_node_type... OK
Applying main.0153_instance_last_seen... OK
Applying main.0154_set_default_uuid... OK
Applying main.0155_improved_health_check... OK
Applying main.0156_capture_mesh_topology... OK
Applying main.0157_inventory_labels... OK
Applying main.0158_make_instance_cpu_decimal... OK
Applying main.0159_deprecate_inventory_source_UoPU_field... OK
Applying main.0160_alter_schedule_rrule... OK
Applying main.0161_unifiedjob_host_status_counts... OK
Applying main.0162_alter_unifiedjob_dependent_jobs... OK
Applying main.0163_convert_job_tags_to_textfield... OK
Applying main.0164_remove_inventorysource_update_on_project_update... OK
Applying main.0165_task_manager_refactor... OK
Applying main.0166_alter_jobevent_host... OK
Applying main.0167_project_signature_validation_credential... OK
Applying main.0168_inventoryupdate_scm_revision... OK
Applying main.0169_jt_prompt_everything_on_launch... OK
Applying main.0170_node_and_link_state... OK
Applying main.0171_add_health_check_started... OK
Applying main.0172_prevent_instance_fallback... OK
Applying main.0173_instancegroup_max_limits... OK
Applying main.0174_ensure_org_ee_admin_roles... OK
Applying main.0175_workflowjob_is_bulk_job... OK
Applying main.0176_inventorysource_scm_branch... OK
Applying main.0177_instance_group_role_addition... OK
2024-03-18 12:38:18,686 INFO [-] awx.main.migrations Initiated migration from Org admin to use role
Applying main.0178_instance_group_admin_migration... OK
Applying main.0179_change_cyberark_plugin_names... OK
Applying main.0180_add_hostmetric_fields... OK
Applying main.0181_hostmetricsummarymonthly... OK
Applying main.0182_constructed_inventory... OK
Applying main.0183_pre_django_upgrade... OK
Applying main.0184_django_indexes... OK
Applying main.0185_move_JSONBlob_to_JSONField... OK
Applying main.0186_drop_django_taggit... OK
Applying main.0187_hop_nodes... OK
Applying main.0188_add_bitbucket_dc_webhook... OK
Applying main.0189_inbound_hop_nodes... OK
Applying main.0190_alter_inventorysource_source_and_more... OK
Applying sites.0001_initial... OK
Applying sites.0002_alter_domain_unique... OK
Applying social_django.0001_initial... OK
Applying social_django.0002_add_related_name... OK
Applying social_django.0003_alter_email_max_length... OK
Applying social_django.0004_auto_20160423_0400... OK
Applying social_django.0005_auto_20160727_2333... OK
Applying social_django.0006_partial... OK
Applying social_django.0007_code_timestamp... OK
Applying social_django.0008_partial_timestamp... OK
Applying social_django.0009_auto_20191118_0520... OK
Applying social_django.0010_uid_db_index... OK
Applying social_django.0011_alter_id_fields... OK
Applying social_django.0012_usersocialauth_extra_data_new... OK
Applying social_django.0013_migrate_extra_data... OK
Applying social_django.0014_remove_usersocialauth_extra_data... OK
Applying social_django.0015_rename_extra_data_new_usersocialauth_extra_data... OK
Applying sso.0001_initial... OK
Applying sso.0002_expand_provider_options... OK
Applying sso.0003_convert_saml_string_to_list... OK
No more data đ˘ and password have been reinitialized ... Why all the data haven't been migrated but the log says OK ?
@craph Could you provide:
kubectl -n <namespace> get pod
kubectl -n <namespace> get pod <psql pod> -o yaml
@kurokobo,
> kubectl -n awx get pod
NAME READY STATUS RESTARTS AGE
awx-demo-migration-24.0.0-rc45z 0/1 Completed 0 41m
awx-demo-postgres-15-0 1/1 Running 0 42m
awx-demo-task-676cbb9bb5-wm6db 4/4 Running 0 42m
awx-demo-web-7cfb6d6d8-9f4gs 3/3 Running 0 42m
awx-operator-controller-manager-865d646cd8-k7ldz 2/2 Running 0 3d5h
and
> kubectl -n awx get pod awx-demo-postgres-15-0 -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
cattle.io/timestamp: "2024-03-18T12:35:22Z"
cni.projectcalico.org/containerID: c8c74a832a87163b6835a4b62528e56c94e573a778c4494770c94ddc5b825cca
cni.projectcalico.org/podIP: 10.42.42.169/32
cni.projectcalico.org/podIPs: 10.42.42.169/32
creationTimestamp: "2024-03-18T12:35:45Z"
generateName: awx-demo-postgres-15-
labels:
app.kubernetes.io/component: database
app.kubernetes.io/instance: postgres-15-awx-demo
app.kubernetes.io/managed-by: awx-operator
app.kubernetes.io/name: postgres-15
app.kubernetes.io/part-of: awx-demo
controller-revision-hash: awx-demo-postgres-15-7fb855c556
statefulset.kubernetes.io/pod-name: awx-demo-postgres-15-0
name: awx-demo-postgres-15-0
namespace: awx
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: StatefulSet
name: awx-demo-postgres-15
uid: 335a3c7f-db87-470c-ba3f-04a3e43c1368
resourceVersion: "218222804"
uid: 78ddf50b-d558-4105-ab0a-3ae409a0c610
spec:
containers:
- env:
- name: POSTGRESQL_DATABASE
valueFrom:
secretKeyRef:
key: database
name: awx-demo-postgres-configuration
- name: POSTGRESQL_USER
valueFrom:
secretKeyRef:
key: username
name: awx-demo-postgres-configuration
- name: POSTGRESQL_PASSWORD
valueFrom:
secretKeyRef:
key: password
name: awx-demo-postgres-configuration
- name: POSTGRES_DB
valueFrom:
secretKeyRef:
key: database
name: awx-demo-postgres-configuration
- name: POSTGRES_USER
valueFrom:
secretKeyRef:
key: username
name: awx-demo-postgres-configuration
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
key: password
name: awx-demo-postgres-configuration
- name: PGDATA
value: /var/lib/pgsql/data/pgdata
- name: POSTGRES_INITDB_ARGS
value: --auth-host=scram-sha-256
- name: POSTGRES_HOST_AUTH_METHOD
value: scram-sha-256
image: quay.io/sclorg/postgresql-15-c9s:latest
imagePullPolicy: IfNotPresent
name: postgres
ports:
- containerPort: 5432
name: postgres-15
protocol: TCP
resources:
requests:
cpu: 10m
memory: 64Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/lib/pgsql/data
name: postgres-15
subPath: data
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-gr59c
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
hostname: awx-demo-postgres-15-0
nodeName: myk8sw1
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
subdomain: awx-demo
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: postgres-15
persistentVolumeClaim:
claimName: postgres-15-awx-demo-postgres-15-0
- name: kube-api-access-gr59c
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2024-03-18T12:35:45Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2024-03-18T12:35:57Z"
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2024-03-18T12:35:57Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2024-03-18T12:35:45Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: containerd://bdccd6a0896147020f969a6f3830a03c10e78772edb75457d7b8b2cb2b0b34a3
image: quay.io/sclorg/postgresql-15-c9s:latest
imageID: quay.io/sclorg/postgresql-15-c9s@sha256:0a88d11f9d15cf10014c25a8dab4be33a1f9b956f4ab1fbb51522ab10c70bdca
lastState: {}
name: postgres
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2024-03-18T12:35:57Z"
hostIP: 10.80.98.196
phase: Running
podIP: 10.42.42.169
podIPs:
- ip: 10.42.42.169
qosClass: Burstable
startTime: "2024-03-18T12:35:45Z"
I still have the old postgres 13 pvc, is it possible to redeploy awx-operator in version 2.12.2 to use the old pvc ?
> kubectl get pvc -n awx
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
postgres-13-awx-demo-postgres-13-0 Bound pvc-eac8a8d5-6d74-4d37-819b-ee89154cd60a 8Gi RWO longhorn 213d
postgres-15-awx-demo-postgres-15-0 Bound pvc-1ebb7d11-d2f5-4f50-9bc6-ca2e045f6031 8Gi RWO longhorn 3d6h
@craph Seems the mounted path is correct, so indeed the DB is not migrated and initialized as fresh install. Operator repeats reconciliation loops until the playbook completes without failed tasks, but if the playbook is run without the old statefulsets present, data migration will not occur. I have not been able to check the implementation in detail, but it may be a situation where migration is deemed unnecessary in the next loop, depending on the failed task (not sure).
Anyway, I recommend that you first get a backup of the 13 PVCs in some way: pgdump or just deploy a working pod and make a tar.gz and copy it to hand with a kubectl cp
.
I assume that just deploying AWX with 2.12.2 will reuse the old PVCs, but if not, you should be able to get the data back by temporarily setting the kubectl scale ... --replicas=0
for Operator, Task and Web after fresh deployment, then restoring PSQL and setting the replicas back to 1
.
@kurokobo how can I do a pgdump on the old 13 PVCs ? any advise ?
@craph I think that can be done by deploying PSQL pod that mounts existing PVC đ¤ Since the situation is complicated, a safe method is better to avoid data loss due to unforeseen circumstances. Sorry I can't suggest a wonderful magical way to do this.
@kurokobo how can I do a pgdump on the old 13 PVCs ? any advise ?
Here's my doc internally how to do it (it might be overkill, but I like to be extra sure). I do believe you'll need the postgres-13 container running in some fashion however for this to work.. Regardless, here's my note:
Dump (backup) the database:
Make sure you have a backup of the awx secrets stored somewhere safe.
The three secrets you will want a backup are as follows:
awx-postgres-configuration
awx-admin-password
awx-app-credentials
To get started, exec into your pod. (You may need to specify the namespace if your context isnât set)
kubectl exec -it awx-postgres-13-0 -- sh
Take a sql dump of the DB
pg_dump -U awx awx > /root/awx.sql
Exit the pod
ctrl-d until youâre back to the OS
Copy the file out of the pod (You may need to specify the namespace if your context isnât set)
kubectl cp awx-postgres-13-0:/root/awx.sql /root/awx/awx.sql
Congrats, you have successfully taken a backup of the AWX Database
And here's my note on the restore of that dump in a new pod. (again, might be overkill) We also use rancher to make our lives easier, so if that doesn't apply to you, you might need to modify these instructions a little bit.
Make sure helm is added to the cluster
[Helm | Installing Helm](https://helm.sh/docs/intro/install/)
helm repo add awx-operator https://ansible.github.io/awx-operator/
helm repo update
Make sure you have the awx_helm_vaules.yml file stored in gitlab
helm install awx-operator awx-operator/awx-operator -f awx_helm_values.yml -n awx
Wait for everything to come up (This might take 3 minutes)
Scale it all down
kubectl scale deployment awx-operator-controller-manager --replicas=0
kubectl scale deployment awx-task --replicas=0
kubectl scale deployment awx-web --replicas=0
Wait for everything to go back down except the Postgres pod
Copy the sql file into the pod. (You may need to specify the namespace if your context isnât set)
kubectl cp awx.sql awx-postgres-13-0:/root/awx.sql
Exec into the pod (You may need to specify the namespace if your context isnât set)
kubectl exec -it awx-postgres-13-0 -- sh
WHILE IN POD
Login to the postgres database and drop the existing awx database:
psql -U awx -d postgres
select pg_terminate_backend(pid) from pg_stat_activity where datname='awx' ;
drop database awx ;
create database awx ;
alter database awx owner to awx ;
\q
Restore the database:
psql -U awx -d awx -f awx.sql
Once the restore is complete, enter the awx db and set the password for the awx user
psql -U awx -d awx
\password
ENTER THE PASSWORD FROM THE SECRET
Exit pod with a bunch of ctrl-d until you hit the OS again
Replace your secrets in Rancher
awx-postgres-configuration
awx-admin-password
awx-app-credentials
For good measure, kill the Postgres pod
kubectl delete pods awx-postgres-13-0 -n awx
Finally, scale your operator, web, and tasks pods back up.
kubectl scale deployment awx-operator-controller-manager --replicas=1
kubectl scale deployment awx-task --replicas=3
kubectl scale deployment awx-web --replicas=3
Bonus Step
If you try to login with the admin password and it refuses, you might need to exec into a web pod and run the following:
awx-manage changepassword admin
Make sure you have a backup of the awx secrets stored somewhere saf
Thank you very much for the update @Tfinn92 but I don't have anymore the awx-postgres-13-0 pod running because I fix the permission issue for postgres 15 and then the pod for postgres 13 doesn't exist anymore :/
Thank you very much for the update @Tfinn92 but I don't have anymore the awx-postgres-13-0 pod running because I fix the permission issue for postgres 15 and then the pod for postgres 13 doesn't exist anymore :/
As long as the pv still exists, you should be able to uninstall awx-operator (do not delete the namespace) and install the old one over it.. It should reattach the pv to the postgres 13 pod. That being said, I think you're still going to run into issues with the web/task pods coming up correctly. I've noticed with the change to pg15, the secrets change to reflect it.
For instance, the secret awx-app-credentials
changes the line host: awx-posgres-13
to 15. Same with the secret awx-postgres-configuration
so you'll need to change both of those back
@kurokobo how can I do a pgdump on the old 13 PVCs ? any advise ?
Here's my doc internally how to do it (it might be overkill, but I like to be extra sure). I do believe you'll need the postgres-13 container running in some fashion however for this to work.. Regardless, here's my note:
Dump (backup) the database: Make sure you have a backup of the awx secrets stored somewhere safe. The three secrets you will want a backup are as follows: awx-postgres-configuration awx-admin-password awx-app-credentials To get started, exec into your pod. (You may need to specify the namespace if your context isnât set) kubectl exec -it awx-postgres-13-0 -- sh Take a sql dump of the DB pg_dump -U awx awx > /root/awx.sql Exit the pod ctrl-d until youâre back to the OS Copy the file out of the pod (You may need to specify the namespace if your context isnât set) kubectl cp awx-postgres-13-0:/root/awx.sql /root/awx/awx.sql Congrats, you have successfully taken a backup of the AWX Database
And here's my note on the restore of that dump in a new pod. (again, might be overkill) We also use rancher to make our lives easier, so if that doesn't apply to you, you might need to modify these instructions a little bit.
Make sure helm is added to the cluster [Helm | Installing Helm](https://helm.sh/docs/intro/install/) helm repo add awx-operator https://ansible.github.io/awx-operator/ helm repo update Make sure you have the awx_helm_vaules.yml file stored in gitlab helm install awx-operator awx-operator/awx-operator -f awx_helm_values.yml -n awx Wait for everything to come up (This might take 3 minutes) Scale it all down kubectl scale deployment awx-operator-controller-manager --replicas=0 kubectl scale deployment awx-task --replicas=0 kubectl scale deployment awx-web --replicas=0 Wait for everything to go back down except the Postgres pod Copy the sql file into the pod. (You may need to specify the namespace if your context isnât set) kubectl cp awx.sql awx-postgres-13-0:/root/awx.sql Exec into the pod (You may need to specify the namespace if your context isnât set) kubectl exec -it awx-postgres-13-0 -- sh WHILE IN POD Login to the postgres database and drop the existing awx database: psql -U awx -d postgres select pg_terminate_backend(pid) from pg_stat_activity where datname='awx' ; drop database awx ; create database awx ; alter database awx owner to awx ; \q Restore the database: psql -U awx -d awx -f awx.sql Once the restore is complete, enter the awx db and set the password for the awx user psql -U awx -d awx \password ENTER THE PASSWORD FROM THE SECRET Exit pod with a bunch of ctrl-d until you hit the OS again Replace your secrets in Rancher awx-postgres-configuration awx-admin-password awx-app-credentials For good measure, kill the Postgres pod kubectl delete pods awx-postgres-13-0 -n awx Finally, scale your operator, web, and tasks pods back up. kubectl scale deployment awx-operator-controller-manager --replicas=1 kubectl scale deployment awx-task --replicas=3 kubectl scale deployment awx-web --replicas=3 Bonus Step If you try to login with the admin password and it refuses, you might need to exec into a web pod and run the following: awx-manage changepassword admin
Thank your very much @Tfinn92 for your documentation. Did you try to use https://github.com/ansible/awx-operator/tree/devel/roles/backup ?
@kurokobo how can I do a pgdump on the old 13 PVCs ? any advise ?
Here's my doc internally how to do it (it might be overkill, but I like to be extra sure). I do believe you'll need the postgres-13 container running in some fashion however for this to work.. Regardless, here's my note:
Dump (backup) the database: Make sure you have a backup of the awx secrets stored somewhere safe. The three secrets you will want a backup are as follows: awx-postgres-configuration awx-admin-password awx-app-credentials To get started, exec into your pod. (You may need to specify the namespace if your context isnât set) kubectl exec -it awx-postgres-13-0 -- sh Take a sql dump of the DB pg_dump -U awx awx > /root/awx.sql Exit the pod ctrl-d until youâre back to the OS Copy the file out of the pod (You may need to specify the namespace if your context isnât set) kubectl cp awx-postgres-13-0:/root/awx.sql /root/awx/awx.sql Congrats, you have successfully taken a backup of the AWX Database
And here's my note on the restore of that dump in a new pod. (again, might be overkill) We also use rancher to make our lives easier, so if that doesn't apply to you, you might need to modify these instructions a little bit.
Make sure helm is added to the cluster [Helm | Installing Helm](https://helm.sh/docs/intro/install/) helm repo add awx-operator https://ansible.github.io/awx-operator/ helm repo update Make sure you have the awx_helm_vaules.yml file stored in gitlab helm install awx-operator awx-operator/awx-operator -f awx_helm_values.yml -n awx Wait for everything to come up (This might take 3 minutes) Scale it all down kubectl scale deployment awx-operator-controller-manager --replicas=0 kubectl scale deployment awx-task --replicas=0 kubectl scale deployment awx-web --replicas=0 Wait for everything to go back down except the Postgres pod Copy the sql file into the pod. (You may need to specify the namespace if your context isnât set) kubectl cp awx.sql awx-postgres-13-0:/root/awx.sql Exec into the pod (You may need to specify the namespace if your context isnât set) kubectl exec -it awx-postgres-13-0 -- sh WHILE IN POD Login to the postgres database and drop the existing awx database: psql -U awx -d postgres select pg_terminate_backend(pid) from pg_stat_activity where datname='awx' ; drop database awx ; create database awx ; alter database awx owner to awx ; \q Restore the database: psql -U awx -d awx -f awx.sql Once the restore is complete, enter the awx db and set the password for the awx user psql -U awx -d awx \password ENTER THE PASSWORD FROM THE SECRET Exit pod with a bunch of ctrl-d until you hit the OS again Replace your secrets in Rancher awx-postgres-configuration awx-admin-password awx-app-credentials For good measure, kill the Postgres pod kubectl delete pods awx-postgres-13-0 -n awx Finally, scale your operator, web, and tasks pods back up. kubectl scale deployment awx-operator-controller-manager --replicas=1 kubectl scale deployment awx-task --replicas=3 kubectl scale deployment awx-web --replicas=3 Bonus Step If you try to login with the admin password and it refuses, you might need to exec into a web pod and run the following: awx-manage changepassword admin
Thank your very much @Tfinn92 for your documentation. Did you try to use https://github.com/ansible/awx-operator/tree/devel/roles/backup ?
I did not, no. I need my backups for DR situations, and having the backup living in the same namespace on the same cluster as the instance doesn't work in that scenario. Sorry :(
Is there any ongoing "patch" regarding this issue? I have AWX deployment with kustomize and of course situation is the same with ""mkdir: cannot create directory '/var/lib/pgsql/data/userdata': Permission denied"". Its also same for new installation. I cant simple move back to previous versions as I have remote Postgre DB instance where I did manual upgrade to 15.x. already/before upgrade of operator. It seems its a BUG, as permissions can not be simple correct and having "pure" AWX operator deployment will fail. Should I seek for manual workaround or this permissions issue will be addressed eventually? With RESPECT to whole AWX DEV's team.
I think a kind of solution is add securityContext
in pod spec, like this:
kind: pod
spec:
securityContext:
runAsUser:26
runAsGroup:26
fsGroup:26
fsGroupChangePolicy:Always
supplementalGroups:
- 26
but I don't know how to add this in awx-postgre pod by awx CR
Could you give this PR a try and see if it solves your issue?
You can recover and rollback to version 2.12.2 if your postgresql 13 statefulset is still online and you edit the secret: 'awx-postgres-configuration' 'host: awx-postgres-15' to 'host: awx-postgres-13' after changing back the version in helm. You may need to restart your pods after doing so
Fresh install of awx-operator 2.14.0 still got this issue
Was anyone able to test the PR I linked?
I am unable to reproduce this issue on Openshift and minikube. Could someone who is seeing this issue please share their k8s cluster type, cluster version, awx-operator version, storage class, and cloud provider used if applicable?
k8s cluster type: on-prem cluster version
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.5", GitCommit:"93e0d7146fb9c3e9f68aa41b2b4265b2fcdb0a4c", GitTreeState:"clean", BuildDate:"2023-08-24T00:48:26Z", GoVersion:"go1.20.7", Compiler:"gc", Platform:"linux/amd64"} Kustomize Version: v5.0.1 Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.5", GitCommit:"93e0d7146fb9c3e9f68aa41b2b4265b2fcdb0a4c", GitTreeState:"clean", BuildDate:"2023-08-24T00:42:11Z", GoVersion:"go1.20.7", Compiler:"gc", Platform:"linux/amd64"}
awx-operator version: quay.io/ansible/awx-operator:2.13.1 storageclass: rook-cephfs cloud provider : N/A
solved this issue by adding
postgres_security_context_settings:
fsGroup: 26
option to AWX CR (cc. @Rory-Z)
if you have already deployed it try editing the postgres statefulset and add fsGroup: 26
to securitycontext
The default permissions and owners of PVs and their subPath
s depend on the storage provisioner implementation for the storage class.
Also, securityContext.fsGroups
may not be valid in all environments, as it is ignored for some types of PVs, such as hostPath
and nfs
, etc.
@rooftopcellist
The default storage provisioner for minikube creates directories with 777
for PVC so this issue can't be reproduced.
It should be possible to reproduce this if explicitly configured to use hostPath
on minikube:
/data/demo
on minikube intance (in docker container or VM, depends on your driver)---
apiVersion: v1
kind: PersistentVolume
metadata:
name: awx-postgres-15-volume
spec:
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
capacity:
storage: 8Gi
storageClassName: awx-postgres-volume
hostPath:
path: /data/demo
postgres_storage_class: awx-postgres-volume
Or following my guide with ignoring chown
and chmod
for /data/postgres-15/data
can also reproduce this.
I've made minimal tests on #1799 and I can confirm that once my comments in #1799 are resolved, it appears to work as expected.
@wonkyooh
security_context_settings
is for web and task PodSecurityContext(pod.spec.securityContext). but postgres_security_context_settings
is for SecurityContext in postgresql container(pod.spec.containers.securityContext). It confuses users.
When I added postgres_security_context_settings: {"fsGroup":26}
to AWX CR, it was ignored.
Was anyone able to test the PR I linked?
I am unable to reproduce this issue on Openshift and minikube. Could someone who is seeing this issue please share their k8s cluster type, cluster version, awx-operator version, storage class, and cloud provider used if applicable?
@rooftopcellist you have all the details here too if needed : https://github.com/ansible/awx-operator/issues/1775#issuecomment-1999830535
AWX Operator version
2.13.1
AWX version
24.0.0
Kubernetes platform
kubernetes selfhosted with Rancher
Kubernetes/Platform version
v1.25.16+rke2r1
Storage Class
Longhorn
Upgrade from 2.12.2 to 2.13.1
I'm also getting this issue when going from 2.10.0
to 2.14.0
. I'm using AKS.
@rooftopcellist here are my details
store class (default in this case means Azure Disk):
$ kubectl get pvc postgres-13-awx-postgres-13-0 -o jsonpath='{.spec.storageClassName}' -n awx
default
When doing an upgrade, the postgres 15 pod crashes:
kubectl get pods -n awx
NAME READY STATUS RESTARTS AGE
awx-operator-controller-manager-cb46cc5dd-qv5db 2/2 Running 0 13m
awx-postgres-13-0 1/1 Running 0 3d23h
awx-postgres-15-0 0/1 CrashLoopBackOff 7 (45s ago) 12m
Logs in the postgres 15 pod:
kubectl logs awx-postgres-15-0 -n awx
mkdir: cannot create directory '/var/lib/pgsql/data/userdata': Permission denied
Here are my deployment details. Kustomization file (when trying to upgrade to 2.14.0 from 2.10.0:
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
# Find the latest tag here: https://github.com/ansible/awx-operator/releases
- github.com/ansible/awx-operator/config/default?ref=2.14.0
- awx.yml
# Set the image tags to match the git version from above
images:
- name: quay.io/ansible/awx-operator
newTag: 2.14.0
# Specify a custom namespace in which to install AWX
namespace: awx
And here's my awx.yml
file. Im using the AGIC:
apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
name: awx
labels:
app: awx
spec:
service_type: clusterip
ingress_type: ingress
ingress_path: /
ingress_path_type: Exact
ingress_tls_secret: tlssecret
hostname: awx.example.org
projects_storage_size: 500Gi
ingress_annotations: |
kubernetes.io/ingress.class: azure/application-gateway
appgw.ingress.kubernetes.io/appgw-ssl-certificate: tlssecret
appgw.ingress.kubernetes.io/health-probe-path: /api/v2/ping
appgw.ingress.kubernetes.io/backend-protocol: http
appgw.ingress.kubernetes.io/backend-hostname: awx.example.org
---
apiVersion: v1
kind: Service
metadata:
name: awx-service
spec:
selector:
app: awx
ports:
- protocol: TCP
port: 80
targetPort: 8052
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
labels:
app: awx
name: awx-ingress
annotations:
kubernetes.io/ingress.class: azure/application-gateway
appgw.ingress.kubernetes.io/appgw-ssl-certificate: tlssecret
appgw.ingress.kubernetes.io/health-probe-path: /api/v2/ping
appgw.ingress.kubernetes.io/backend-protocol: http
appgw.ingress.kubernetes.io/backend-hostname: awx.example.org
spec:
rules:
- host: awx.example.org
http:
paths:
- path: /
backend:
service:
name: awx
port:
number: 80
pathType: Exact
One thing I did notice is that when the pvc is created for postgres 15, it doesn't allocate the correct amount of storage specified for projects_storage_size
, not sure if that is related or not.
kubectl get pvc -n awx
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
postgres-13-awx-postgres-13-0 Bound pvc-a69fc0f3-929b-4ba8-8c72-8ca1ad15b8af 500Gi RWO default 166d
postgres-15-awx-postgres-15-0 Bound pvc-114e9ae9-9376-496c-b59d-edbe8b5ce4d5 8Gi RWO default 20m
I was able to recover by deleting AWX
, the postgres 15 pod and pvc, and redeploying with operator 2.10
Please weight in on which PR approach you like better:
spec:
postgres_data_volume_init: true
init_postgres_extra_commands: |
chown 26:0 /var/lib/pgsql/data
chmod 700 /var/lib/pgsql/data
+1 for postgres_data_volume_init
Please weight in on which PR approach you like better:
spec: postgres_data_volume_init: true
init_postgres_extra_commands: | chown 26:0 /var/lib/pgsql/data chmod 700 /var/lib/pgsql/data
đ PR #1805 will provide a better user experience I think
Thanks for weighing in all and for the review of the PR. There is one more potential issue to resolve because of the removal of the postgres_init_container_resource_requirements
parameter. More details on the PR.
This was resolved by https://github.com/ansible/awx-operator/pull/1805, which just merged.
Awesome work. I'm hitting this as well. I'm using Kustomize, but referring to the commit sha doesn't seem to change anything.
Any tips on how to include this fix without manual fiddling in the cluster?
How does one fix their environment if they already went to version 2.12. I waited for 2.15 in hopes that Operator would fix the issue; however, the environment is currently down due to this issue and am unsure how to correct it. What steps need to be done to correct the broken environment. I see some mentions of init_postgres_extra_commands but am unsure of where values to this parameter need to be placed.
I had same issue, you need to spawn following pod:
apiVersion: v1
kind: Pod
metadata:
name: pvc-inspector
namespace: awx-prod
spec:
containers:
- image: busybox
name: pvc-inspector
command: ["tail"]
args: ["-f", "/dev/null"]
volumeMounts:
- mountPath: /pvc
name: pvc-mount
volumes:
- name: pvc-mount
persistentVolumeClaim:
claimName: postgres-15-awx-postgres-15-0
shell to it and run chown -R 26:26 /pvc/data/
Later on you will also need to update CRDs by kubectl apply -n 'awx-prod' --server-side -k "github.com/ansible/awx-operator/config/crd?ref=2.15.0" --force-conflicts
Having same issue with Postgre 15 pod, in time of troubleshooting, by accident I remove whole namespace (by executing "kustomize delete -k ."). I noticed that later by troubleshooting postgre db connectivity problems, that kustomize is also deleting namspace itself.
My task pods wont start and web is saying: "awx.main.utils.encryption Failed to decrypt.... ....check that you 'SECRET_KEY' value is correct".
I'm sure that "awx-app-secret-key" was rewritten by kustomize execution and I dont have backup of old secret. I can connect to postgre DB instance and to AWX db as well, but have no valid awx-secret-key.
Is there a way to retrieve it from DB itself or it is not store there anywhere? In other words, is this instance lost by loosing "awx-secret-key" ??
Please confirm the following
Bug Summary
Updating to 2.13.1 through helm results in the postgres15 pod having the following error:
cannot create directory '/var/lib/pgsql/data/userdata': Permission denied"
AWX Operator version
2.13.1
AWX version
24.0.0
Kubernetes platform
kubernetes
Kubernetes/Platform version
Rancher RKE2 v1.26.8+rke2r1 and another on v1.27.10+rke2r1
Modifications
no
Steps to reproduce
Have cluster with 2.12.2 installed and run
helm upgrade awx-operator awx-operator/awx-operator
Expected results
pods come up no problem
Actual results
postgres15 pod CrashLoopBackOff Logs show
"mkdir: cannot create directory '/var/lib/pgsql/data/userdata': Permission denied"
Additional information
No response
Operator Logs
No response