canonical / postgresql-k8s-operator

A Charmed Operator for running PostgreSQL on Kubernetes
https://charmhub.io/postgresql-k8s
Apache License 2.0
9 stars 18 forks source link

Permission denied when renaming pgdata to pgdata.failed #460

Closed gtato closed 1 month ago

gtato commented 4 months ago

Steps to reproduce

This happened in prod, and I haven't reproduced in a local env.

At some point replicas go out of sync and try to restore pgdata from the primary, but fail with this error:

2024-04-29 09:16:53 UTC [15]: ERROR: Could not rename data directory /var/lib/postgresql/data/pgdata 
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/patroni/postgresql/__init__.py", line 1314, in remove_data_directory
    shutil.rmtree(self._data_dir)
  File "/usr/lib/python3.10/shutil.py", line 731, in rmtree
    onerror(os.rmdir, path, sys.exc_info())
  File "/usr/lib/python3.10/shutil.py", line 729, in rmtree
    os.rmdir(path)
PermissionError: [Errno 13] Permission denied: '/var/lib/postgresql/data/pgdata'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/patroni/postgresql/__init__.py", line 1287, in move_data_directory
    os.rename(self._data_dir, new_name)
PermissionError: [Errno 13] Permission denied: '/var/lib/postgresql/data/pgdata' -> '/var/lib/postgresql/data/pgdata.failed'

Expected behavior

Replicas retrieve correctly the wal entries from the primary and restore their state.

Actual behavior

Replicas fail to get pgdata and fail. This in turn causes the primary wal to increase the size until it can't function properly.

Versions

Operating system:

Juju CLI: 2.9.49

Juju agent: 3.1.8

Charm revision: 14/edge 198

microk8s: v1.26.15

Log output

Juju debug log:

Additional context

To resolve this issue I used these steps: https://matrix.to/#/!BukWfnyOTgQSKAxdtT:ubuntu.com/$C-iLZEXS39xBD8vFV40EVBWFefNjlvUQmxFDxNcS2p0?via=ubuntu.com&via=matrix.org

but the I am not sure how to prevent this issue in the future.

github-actions[bot] commented 4 months ago

https://warthogs.atlassian.net/browse/DPE-4227

marceloneppel commented 1 month ago

Steps to reproduce on GKE:

juju ssh --container postgresql postgresql-k8s/leader bash # leader
apt update && apt install nano curl -y
nano /var/lib/postgresql/data/patroni.yml

# Remove the other units from both pg_hba sections.

curl -X POST localhost:8008/reload # leader

# Wait 30 seconds.

juju ssh --container postgresql postgresql-k8s/0 bash # replica
apt update && apt install curl -y
curl -X POST localhost:8008/reinitialize

juju ssh --container postgresql postgresql-k8s/1 bash # replica
apt update && apt install curl -y
curl -X POST localhost:8008/reinitialize

The issue is related to the permissions in the volume mounted in the units, like in https://warthogs.atlassian.net/browse/DPE-707. I'll create a PR to fix that.

marceloneppel commented 1 month ago

Hi, @gtato!

Revisions 332 and 333 from the 14/edge channel contain the fix for this issue.