CrunchyData / postgres-operator

Production PostgreSQL for Kubernetes, from high availability Postgres clusters to full-scale database-as-a-service.
https://access.crunchydata.com/documentation/postgres-operator/v5/
Apache License 2.0
3.93k stars 592 forks source link

pgbackrest: Unable to get address #3791

Closed mausch closed 9 months ago

mausch commented 11 months ago

Overview

With this rather trivial PostgresCluster:

apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
  name: postgres
  namespace: postgres
spec:
  backups:
    pgbackrest:
      global:
        archive-async: 'y'
        archive-push-queue-max: 1Gi
        repo1-retention-full: '31'
        repo1-retention-full-type: time
        spool-path: /pgdata/backups
      repos:
        - name: repo1
          schedules:
            full: 0 1 * * *
          volume:
            volumeClaimSpec:
              accessModes:
                - ReadWriteOnce
              resources:
                requests:
                  storage: 20Gi
  image: >-
    registry.developers.crunchydata.com/crunchydata/crunchy-postgres-gis:ubi8-16.0-3.4-0
  instances:
    - dataVolumeClaimSpec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 1000Gi
      name: instance1
      replicas: 1
  patroni:
    dynamicConfiguration:
      postgresql:
        parameters:
          max_wal_size: 128MB
          wal_buffers: 2MB
          wal_init_zero: 'off'
          wal_recycle: 'off'
    leaderLeaseDurationSeconds: 30
    port: 8008
    syncPeriodSeconds: 10
  port: 5432
  postGISVersion: 3.4.0
  postgresVersion: 16
  users:
    - name: elevate
      options: SUPERUSER
    - name: postgres

The pgo logs show this error:

time="2023-11-29T11:54:23Z" level=error msg="unable to create stanza" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster error="command terminated with exit code 49: ERROR: [049]: unable to get address for 'postgres-repo-host-0.postgres-pods.postgres.svc.kubernetes': [-2] Name or service not known\n" file="internal/controller/postgrescluster/pgbackrest.go:2618" func="postgrescluster.(*Reconciler).reconcileStanzaCreate" name=postgres namespace=postgres postgresCluster=postgres/postgres reconcileID=70c5f483-1347-49c9-892f-b65d5be3d06e reconciler=pgBackRest version=5.4.3-0-amd64

Pod postgres-repo-host-0 is running without any visible issues. No idea where this postgres-repo-host-0.postgres-pods.postgres.svc.kubernetes hostname comes from? It should be postgres-repo-host-0.postgres.svc.cluster.local.

Also not sure what the consequences are. I see thousands of failures in pg_stat_archiver, would this cause that?

Environment

Please provide the following details:

andrewlecuyer commented 11 months ago

@mausch sorry to hear you are having trouble.

This error you are seeing indicates a DNS issue in your environment. Per the Kubernetes "DNS for Services and Pods" docs:

Any Pods exposed by a Service have the following DNS resolution available:

pod-ip-address.service-name.my-namespace.svc.cluster-domain.example.

Therefore, postgres-repo-host-0.postgres-pods.postgres.svc.kubernetes should properly resolve.

I recommend debugging DNS resolution in your environment so see if any issues exist, e.g. per the following docs:

https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/

Otherwise, this does not appear to be an issue with PGO itself (e.g. since stanza creation is working just fine in a variety of Kubernetes environments).

dsessler7 commented 9 months ago

We have not heard back, so I am closing this issue. If you need further assistance, feel free to respond here or ask a question in our Discord server.