Open bilemedimkq opened 3 years ago
You may be experiencing a replica that has desynced; the WAL might grow because it's keeping to-be-synchronised transactions.
You may be experiencing a replica that has desynced; the WAL might grow because it's keeping to-be-synchronised transactions.
PGO does not use replication slots by default -- a replica that is behind will pull WAL from the backup archive. WAL won't be retained for this reason unless one explicitly enables slots.
Edit: @haf ...though, yes, upon reflection of personal experience, that could be the case as well, that the replicas are not keeping up.
This is the configuration I'm using @jkatz ;
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
name: sm-pgo-db
spec:
image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres-ha:centos8-13.4-0
postgresVersion: 13
users:
- name: postgres
- name: sm-user
options: "SUPERUSER"
instances:
- name: pgha1
replicas: 3
resources:
limits:
memory: 7Gi
requests:
memory: 4.5Gi
dataVolumeClaimSpec:
accessModes:
- "ReadWriteOnce"
resources:
requests:
storage: 150Gi
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
podAffinityTerm:
topologyKey: kubernetes.io/hostname
labelSelector:
matchLabels:
postgres-operator.crunchydata.com/cluster: sm-pgo-db
postgres-operator.crunchydata.com/instance-set: pgha1
backups:
pgbackrest:
image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:centos8-2.33-2
repoHost:
dedicated: {}
configuration:
- secret:
name: sm-pgo-db-creds
global:
repo2-path: /sm-pgo-db-repo2
repo1-retention-full: "10"
repo1-retention-full-type: time
repo2-retention-full: "10"
repo2-retention-full-type: time
repos:
- name: repo1
volume:
volumeClaimSpec:
accessModes:
- "ReadWriteOnce"
resources:
requests:
storage: 100Gi
- name: repo2
schedules:
incremental: "*/10 * * * *"
s3:
bucket: "sm-pgbackrest-incremental"
endpoint: "s3.fr-par.scw.cloud"
region: "fr-par"
patroni:
dynamicConfiguration:
postgresql:
parameters:
max_connections: '2048'
2048 connections is a lot, which leads me to believe you're pushing a lot of data through.
I think that @haf may be correct that your replicas are not keeping up with your throughput. Based upon latency as well as your current pgBackRest configuration (which does not appear to have any of the performance settings tweaked), you may not be able to push WAL fast enough to your repository.
Additionally, it appears your backup retention policy is only on full backups, but you're taking incremental backups every 10 minutes, so the amount of backups you have will grow infinitely without being pruned. I'm unsure why you need to take incremental backups so frequently either, but I do not know what your problem domain is.
Anyway, this does not appear to be a bug but rather an operational support issue. There is guidance in here to get you started on resolving it.
Hi @jkatz !
No 2048 connection, I made it for a try. There are 20-30 connections on average. The database is about 2-3 gb in size.
However, I found a few interesting things.
I am using longhorn to manage storage. The replica on longhorn looks like ~14 gb. But when I go into it and check it's only 2gb. There is a problem with longhorn, I will look into it.
When I checked the repo1 host (local pgbackrest), I saw that it was indeed 20gb in size. and it's all made up of wal files. Can we apply retention to wal files as in backups?
Same issue here.
My customer wants to deploy a standalone postgres instances using crunchy. As such, I did setup that operator, with a demo PostgresCluster.
200Gi storage We're now 14days into my test. I had no time to stress it ... ... And my volume is full.of WAL. My customer still doesn't have access to that setup, I'm pretty sure we're not writing anything. I barely initialized a schema in that DB.
Why is this case closed? What would be the solution?
Note: statefulset has a single replica. I see those WAL would have to do with some replica that's behind ... to my understanding, I do not have any replica. While the backup jobs started failing this morning. Given the postgres service is unavailable. Due to readiness checks failing, as my disk is full.
What can I do, besides starting from scratch? Or growing disks up until reaching my cloud providers limits? Is crunchy stable to begin with?!
Sorry to hear you are having an issue.
This does not sound like a bug in the operator, but rather a configuration issue. Earlier in this thread a community member suggested some potential solutions.
My colleague tried to provide some more context on how the incremental backup configuration may be producing the observed WAL here:
https://github.com/CrunchyData/postgres-operator/issues/2758#issuecomment-939160817
For additional information on setting up backups, the tutorials section in the documentation may be a good place to start:
If you would like to join the postgres operator mailing list and submit the details of your configuration there, a community member or a member of our team may be able to help you troubleshoot the details of the situation.
Of course if you are interested in commercial / production support, please reach out to info@crunchydata.com.
We will leave this issue open to the extent there are any more details of your configuration you can share, but generally prefer to use the email list for community assistance.
Hello. I use postgres operator v5. It looks like WAL logs of the database growing rapidly. Can we apply retention to WAL logs? Like storing WAL logs from 10 days ago and deleting the rest?