cloudnative-pg / charts

CloudNativePG Helm Charts
Apache License 2.0
196 stars 98 forks source link

Backups getting stucked on walArchivingFailing phase #262

Closed vatsal-kavida closed 6 months ago

vatsal-kavida commented 7 months ago

Have deployed the cnpg operator and cluster successfully on my eks cluster, but the scheduled backups is getting stucked at walArchivingFailing phase. Below given is my helm values.yaml file for the cluster ->

`# -- Override the name of the chart
nameOverride: ""
# -- Override the full name of the chart
fullnameOverride: ""

###
# -- Type of the CNPG database. Available types:
# * `postgresql`
# * `postgis`
type: postgresql

###
# -- Cluster mode of operation. Available modes:
# * `standalone` - default mode. Creates new or updates an existing CNPG cluster.
# * `replica` - Creates a replica cluster from an existing CNPG cluster. # TODO
# * `recovery` - Same as standalone but creates a cluster from a backup, object store or via pg_basebackup.
mode: standalone

recovery:
  ##
  # -- Available recovery methods:
  # * `backup` - Recovers a CNPG cluster from a CNPG backup (PITR supported) Needs to be on the same cluster in the same namespace.
  # * `object_store` - Recovers a CNPG cluster from a barman object store (PITR supported).
  # * `pg_basebackup` - Recovers a CNPG cluster viaa streaming replication protocol. Useful if you want to
  #        migrate databases to CloudNativePG, even from outside Kubernetes. # TODO
  method: backup

  ## -- Point in time recovery target. Specify one of the following:
  pitrTarget:
    # -- Time in RFC3339 format
    time: ""

  ##
  # -- Backup Recovery Method
  backupName: ""  # Name of the backup to recover from. Required if method is `backup`.

  ##
  # -- The original cluster name when used in backups. Also known as serverName.
  clusterName: ""
  # -- Overrides the provider specific default endpoint. Defaults to:
  # S3: https://s3.<region>.amazonaws.com"
  # Leave empty if using the default S3 endpoint
  endpointURL: ""
  # -- Specifies a CA bundle to validate a privately signed certificate.
  endpointCA:
    # -- Creates a secret with the given value if true, otherwise uses an existing secret.
    create: false
    name: ""
    key: ""
    value: ""
  # -- Overrides the provider specific default path. Defaults to:
  # S3: s3://<bucket><path>
  # Azure: https://<storageAccount>.<serviceName>.core.windows.net/<containerName><path>
  # Google: gs://<bucket><path>
  destinationPath: ""
  # -- One of `s3`, `azure` or `google`
  provider: s3
  s3:
    region: ""
    bucket: ""
    path: "/"
    accessKey: ""
    secretKey: ""
  azure:
    path: "/"
    connectionString: ""
    storageAccount: ""
    storageKey: ""
    storageSasToken: ""
    containerName: ""
    serviceName: blob
    inheritFromAzureAD: false
  google:
    path: "/"
    bucket: ""
    gkeEnvironment: false
    applicationCredentials: ""

cluster:
  # -- Number of instances
  instances: 2

  # -- Name of the container image, supporting both tags (<image>:<tag>) and digests for deterministic and repeatable deployments:
  # <image>:<tag>@sha256:<digestValue>
  imageName: ""  # Default value depends on type (postgresql/postgis/timescaledb)

  # -- Image pull policy. One of Always, Never or IfNotPresent. If not defined, it defaults to IfNotPresent. Cannot be updated.
  # More info: https://kubernetes.io/docs/concepts/containers/images#updating-images
  imagePullPolicy: Always

  # -- The list of pull secrets to be used to pull the images.
  # See: https://cloudnative-pg.io/documentation/current/cloudnative-pg.v1/#postgresql-cnpg-io-v1-LocalObjectReference
  imagePullSecrets: []

  storage:
    size: 20Gi
    storageClass: "gp3"

  # -- The UID of the postgres user inside the image, defaults to 26
  postgresUID: 26

  # -- The GID of the postgres user inside the image, defaults to 26
  postgresGID: 26

  # -- Resources requirements of every generated Pod.
  # Please refer to https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/ for more information.
  # We strongly advise you use the same setting for limits and requests so that your cluster pods are given a Guaranteed QoS.
  # See: https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/
  resources:
    limits:
      cpu: 2000m
      memory: 8Gi
    requests:
      cpu: 200m
      memory: 1Gi

  priorityClassName: ""

  # -- Method to follow to upgrade the primary server during a rolling update procedure, after all replicas have been
  # successfully updated. It can be switchover (default) or in-place (restart).
  primaryUpdateMethod: switchover

  # -- Strategy to follow to upgrade the primary server during a rolling update procedure, after all replicas have been
  # successfully updated: it can be automated (unsupervised - default) or manual (supervised)
  primaryUpdateStrategy: unsupervised

  # -- The instances' log level, one of the following values: error, warning, info (default), debug, trace
  logLevel: "debug"

  # -- Affinity/Anti-affinity rules for Pods.
  # See: https://cloudnative-pg.io/documentation/current/cloudnative-pg.v1/#postgresql-cnpg-io-v1-AffinityConfiguration
  affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
            - matchExpressions:
                - key: Deployment
                  operator: In
                  values:
                    - copilot
#    topologyKey: topology.kubernetes.io/zone

  # -- The configuration for the CA and related certificates.
  # See: https://cloudnative-pg.io/documentation/current/cloudnative-pg.v1/#postgresql-cnpg-io-v1-CertificatesConfiguration
  certificates:

  # -- When this option is enabled, the operator will use the SuperuserSecret to update the postgres user password.
  # If the secret is not present, the operator will automatically create one.
  # When this option is disabled, the operator will ignore the SuperuserSecret content, delete it when automatically created,
  # and then blank the password of the postgres user by setting it to NULL.
  enableSuperuserAccess: true
  superuserSecret: ""

  monitoring:
    # -- Whether to enable monitoring
    enabled: true
    podMonitor:
      # -- Whether to enable the PodMonitor
      enabled: true
    prometheusRule:
      # -- Whether to enable the PrometheusRule automated alerts
      enabled: true
      # -- Exclude specified rules
      excludeRules: []
        # - CNPGClusterZoneSpreadWarning
    # -- Custom Prometheus metrics
    customQueries:
       - name: "pg_cache_hit_ratio"
         query: "SELECT current_database() as datname, sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) as ratio FROM pg_statio_user_tables;"
         metrics:
           - datname:
               usage: "LABEL"
               description: "Name of the database"
           - ratio:
               usage: GAUGE
               description: "Cache hit ratio"

  # -- Configuration of the PostgreSQL server.
  # See: https://cloudnative-pg.io/documentation/current/cloudnative-pg.v1/#postgresql-cnpg-io-v1-PostgresConfiguration
  postgresql: {}
    # max_connections: 300

  # -- BootstrapInitDB is the configuration of the bootstrap process when initdb is used.
  # See: https://cloudnative-pg.io/documentation/current/bootstrap/
  # See: https://cloudnative-pg.io/documentation/current/cloudnative-pg.v1/#postgresql-cnpg-io-v1-bootstrapinitdb
  initdb: {}
    # database: app
    # owner: "" # Defaults to the database name
    # secret: "" # Name of the secret containing the initial credentials for the owner of the user database. If empty a new secret will be created from scratch
    # postInitSQL:
    #   - CREATE EXTENSION IF NOT EXISTS vector;

  additionalLabels: {}
  annotations:
    cnpg.io/skipWalArchiving: "enabled"

backups:
  # -- You need to configure backups manually, so backups are disabled by default.
  enabled: true

  # -- Overrides the provider specific default endpoint. Defaults to:
  # S3: https://s3.<region>.amazonaws.com"
  endpointURL: ""  # Leave empty if using the default S3 endpoint
  # -- Specifies a CA bundle to validate a privately signed certificate.
  endpointCA:
    # -- Creates a secret with the given value if true, otherwise uses an existing secret.
    create: false
    name: ""
    key: ""
    value: ""

  # -- Overrides the provider specific default path. Defaults to:
  # S3: s3://<bucket><path>
  # Azure: https://<storageAccount>.<serviceName>.core.windows.net/<containerName><path>
  # Google: gs://<bucket><path>
  destinationPath: ""
  # -- One of `s3`, `azure` or `google`
  provider: s3
  s3:
    region: "us-east-1"
    bucket: "ds-mongodb-backup"
    path: "/postgresbackup/"
    accessKey: "AKIAZMOEICVTYDS2R5RXHB"
    secretKey: "eWKStVPqqI2nzq5uYCJKDWlOTxQT2RjlsWFsnx2CYR"
  azure:
    path: "/"
    connectionString: ""
    storageAccount: ""
    storageKey: ""
    storageSasToken: ""
    containerName: ""
    serviceName: blob
    inheritFromAzureAD: false
  google:
    path: "/"
    bucket: ""
    gkeEnvironment: false
    applicationCredentials: ""

#  wal:
#    # -- WAL compression method. One of `` (for no compression), `gzip`, `bzip2` or `snappy`.
#    compression: gzip
#    # -- Whether to instruct the storage provider to encrypt WAL files. One of `` (use the storage container default), `AES256` or `aws:kms`.
#    encryption: AES256
#    # -- Number of WAL files to be archived or restored in parallel.
#    maxParallel: 1
  data:
    # -- Data compression method. One of `` (for no compression), `gzip`, `bzip2` or `snappy`.
    compression: gzip
    # -- Whether to instruct the storage provider to encrypt data files. One of `` (use the storage container default), `AES256` or `aws:kms`.
    encryption: AES256
    # -- Number of data files to be archived or restored in parallel.
    jobs: 2

  scheduledBackups:
    - name: daily-backup
      schedule: "0 0 10 * * *"
      backupOwnerReference: self

  # -- Retention policy for backups
  retentionPolicy: "15d"

pooler:
  # -- Whether to enable PgBouncer
  enabled: true
  # -- PgBouncer pooling mode
  poolMode: transaction
  # -- Number of PgBouncer instances
  instances: 1
  # -- PgBouncer configuration parameters
  parameters:
    max_client_conn: "1000"
    default_pool_size: "25"

  monitoring:
    # -- Whether to enable monitoring
    enabled: true
    podMonitor:
        # -- Whether to enable the PodMonitor
      enabled: true

  # -- Custom PgBouncer deployment template.
  # Use to override image, specify resources, etc.
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: Deployment
                operator: In
                values:
                  - copilot

  template: {}`

Could anyonee please help me with the same why is it getting stuck and not working properly. Also if anyone can help me where I can find the backup logs.

gpothier commented 7 months ago

The backup logs are in the same pod as the main postgres instance. You should find out the cause of the failure there. I had the same problem, and in my particular case it was because the S3 bucket already had some data from a previous test on it, and the error was like that:

ERROR: WAL archive check failed for server hippo-cluster: Expected empty archive
itay-grudev commented 6 months ago

I'm closing this for now. If you are still having the issue could you please attach some logs from the primary instance?