CrunchyData / postgres-operator-examples

Examples for deploying applications with PGO, the Postgres Operator from Crunchy Data
https://access.crunchydata.com/documentation/postgres-operator/v5/
Apache License 2.0
187 stars 4.56k forks source link

Postgres-instance stopped working "could not write lock file "postmaster.pid": No space left on device" #275

Open rg2609 opened 1 month ago

rg2609 commented 1 month ago

I have installed the Crunchydata Postgres operator using the helm chart and I'm using the NFS file system storage as PVC with 200GB data. When I run kubectl get pods, the output shows that 1 instance has stopped working out of 4.

kubectl get pods | grep -E "READY|postgres-instance"
NAME                                    READY   STATUS             RESTARTS   AGE
abc-postgres-instance1-c7ck-0           3/4     Running            0          32d

The error that I'm encountering is:

2024-07-12 21:20:21,317 INFO:  stderr=2024-07-12 21:20:21.317 UTC [2615762] FATAL:  could not write lock file "postmaster.pid": No space left on device

Upon further investigation, I found that the /pgdata/pg16_wal is using up all the space.

When I run kubectl exec -it dravoka-postgres-instance1-c7ck-0 -- bin/bash and then df -h, I get the following output:

Filesystem                                                      Size  Used Avail Use% Mounted on
overlay                                                         124G   51G   74G  41% /
tmpfs                                                            64M     0   64M   0% /dev
172.16.215.54:/export/pvc-c862217c-39c4-4be0-a8a1-1717c450b2d1  196G  196G     0 100% /pgdata
/dev/root                                                       124G   51G   74G  41% /tmp
tmpfs                                                            57G   24K   57G   1% /pgconf/tls
tmpfs                                                            57G   24K   57G   1% /etc/database-containerinfo
tmpfs                                                            57G   16K   57G   1% /etc/patroni
tmpfs                                                            57G     0   57G   0% /dev/shm
tmpfs                                                            57G   24K   57G   1% /etc/pgbackrest/conf.d
tmpfs                                                            57G   12K   57G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs                                                            32G     0   32G   0% /proc/acpi
tmpfs                                                            32G     0   32G   0% /proc/scsi
tmpfs                                                            32G     0   32G   0% /sys/firmware

Is there a way to configure the WAL size limit and archive the old WAL files while also deleting the old archived files?

dsessler7 commented 1 month ago

Hey @rg2609!

When we see the WAL fill up like this, it is almost always due to backups not being run frequently enough. When a backup runs, it captures the current state of the database, allowing postgres to clear out the WAL files as they are no longer needed.

I recommend that you set up a schedule for your backups, or if you already have a schedule set, try increasing the frequency with which you run your backups so that the WAL can be regularly flushed out. Here are the docs for all things backups:

https://access.crunchydata.com/documentation/postgres-operator/latest/tutorials/backups-disaster-recovery

If you need further assistance, I recommend that you join our Discord group and ask questions there as it is a more active forum for the postgres-operator community.