CrunchyData / postgres-operator

Production PostgreSQL for Kubernetes, from high availability Postgres clusters to full-scale database-as-a-service.
https://access.crunchydata.com/documentation/postgres-operator/v5/
Apache License 2.0
3.91k stars 590 forks source link

wal retention #2757

Open bilemedimkq opened 3 years ago

bilemedimkq commented 3 years ago

Hello. I use postgres operator v5. It looks like WAL logs of the database growing rapidly. Can we apply retention to WAL logs? Like storing WAL logs from 10 days ago and deleting the rest?

haf commented 3 years ago

You may be experiencing a replica that has desynced; the WAL might grow because it's keeping to-be-synchronised transactions.

jkatz commented 3 years ago
jkatz commented 3 years ago

You may be experiencing a replica that has desynced; the WAL might grow because it's keeping to-be-synchronised transactions.

PGO does not use replication slots by default -- a replica that is behind will pull WAL from the backup archive. WAL won't be retained for this reason unless one explicitly enables slots.

Edit: @haf ...though, yes, upon reflection of personal experience, that could be the case as well, that the replicas are not keeping up.

bilemedimkq commented 3 years ago

This is the configuration I'm using @jkatz ;

apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
  name: sm-pgo-db
spec:
  image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres-ha:centos8-13.4-0
  postgresVersion: 13
  users:
    - name: postgres
    - name: sm-user
      options: "SUPERUSER"
  instances:
    - name: pgha1
      replicas: 3
      resources:
        limits:
          memory: 7Gi
        requests:
          memory: 4.5Gi
      dataVolumeClaimSpec:
        accessModes:
        - "ReadWriteOnce"
        resources:
          requests:
            storage: 150Gi
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 1
            podAffinityTerm:
              topologyKey: kubernetes.io/hostname
              labelSelector:
                matchLabels:
                  postgres-operator.crunchydata.com/cluster: sm-pgo-db
                  postgres-operator.crunchydata.com/instance-set: pgha1

  backups:
    pgbackrest:
      image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:centos8-2.33-2
      repoHost:
        dedicated: {}
      configuration:
      - secret:
          name: sm-pgo-db-creds
      global:
        repo2-path: /sm-pgo-db-repo2
        repo1-retention-full: "10"
        repo1-retention-full-type: time
        repo2-retention-full: "10"
        repo2-retention-full-type: time
      repos:
      - name: repo1
        volume:
          volumeClaimSpec:
            accessModes:
            - "ReadWriteOnce"
            resources:
              requests:
                storage: 100Gi
      - name: repo2
        schedules:
          incremental: "*/10 * * * *"
        s3:
          bucket: "sm-pgbackrest-incremental"
          endpoint: "s3.fr-par.scw.cloud"
          region: "fr-par"

  patroni:
    dynamicConfiguration:
      postgresql:
        parameters:
          max_connections: '2048'
jkatz commented 3 years ago

2048 connections is a lot, which leads me to believe you're pushing a lot of data through.

I think that @haf may be correct that your replicas are not keeping up with your throughput. Based upon latency as well as your current pgBackRest configuration (which does not appear to have any of the performance settings tweaked), you may not be able to push WAL fast enough to your repository.

Additionally, it appears your backup retention policy is only on full backups, but you're taking incremental backups every 10 minutes, so the amount of backups you have will grow infinitely without being pruned. I'm unsure why you need to take incremental backups so frequently either, but I do not know what your problem domain is.

Anyway, this does not appear to be a bug but rather an operational support issue. There is guidance in here to get you started on resolving it.

bilemedimkq commented 3 years ago

Hi @jkatz !

No 2048 connection, I made it for a try. There are 20-30 connections on average. The database is about 2-3 gb in size.

However, I found a few interesting things.

I am using longhorn to manage storage. The replica on longhorn looks like ~14 gb. image But when I go into it and check it's only 2gb. There is a problem with longhorn, I will look into it.

image


image

When I checked the repo1 host (local pgbackrest), I saw that it was indeed 20gb in size. and it's all made up of wal files. Can we apply retention to wal files as in backups?

faust64 commented 1 year ago

Same issue here.

My customer wants to deploy a standalone postgres instances using crunchy. As such, I did setup that operator, with a demo PostgresCluster.

200Gi storage We're now 14days into my test. I had no time to stress it ... ... And my volume is full.of WAL. My customer still doesn't have access to that setup, I'm pretty sure we're not writing anything. I barely initialized a schema in that DB.

Why is this case closed? What would be the solution?

Note: statefulset has a single replica. I see those WAL would have to do with some replica that's behind ... to my understanding, I do not have any replica. While the backup jobs started failing this morning. Given the postgres service is unavailable. Due to readiness checks failing, as my disk is full.

What can I do, besides starting from scratch? Or growing disks up until reaching my cloud providers limits? Is crunchy stable to begin with?!

andrewlecuyer commented 1 year ago

Sorry to hear you are having an issue.

This does not sound like a bug in the operator, but rather a configuration issue. Earlier in this thread a community member suggested some potential solutions.

My colleague tried to provide some more context on how the incremental backup configuration may be producing the observed WAL here:

https://github.com/CrunchyData/postgres-operator/issues/2758#issuecomment-939160817

For additional information on setting up backups, the tutorials section in the documentation may be a good place to start:

If you would like to join the postgres operator mailing list and submit the details of your configuration there, a community member or a member of our team may be able to help you troubleshoot the details of the situation.

Of course if you are interested in commercial / production support, please reach out to info@crunchydata.com.

We will leave this issue open to the extent there are any more details of your configuration you can share, but generally prefer to use the email list for community assistance.