CrunchyData / postgres-operator

Production PostgreSQL for Kubernetes, from high availability Postgres clusters to full-scale database-as-a-service.
https://access.crunchydata.com/documentation/postgres-operator/v5/
Apache License 2.0
3.91k stars 588 forks source link

Unable to get Postgres cluster replicas up in an IPv6 environment #3463

Closed scriptac closed 1 year ago

scriptac commented 1 year ago

Overview

I'm unable to get the secondary replica up in my postgres deployment. I also did the same deployment in a different test lab with the same Kubernetes, Helm, postgres and PGO versions. However, the test lab is IPv4 and the environment I'm deploying in right now is IPv6.

Environment

Platform: robin.io Platform Version: 5.3.11-217 Kubernetes Version: v1.21.5 Helm Version: v3.5.3 PGO Image Tag: ubi8-5.1.1-0 Postgres Version: 14

It is a restricted access environment, where I'm only the namespace admin and not the cluster admin.

Issue

The Helm chart creates 2 replicas, as a result 2 pods are created and the leader is elected successfully. The leader pod is up and running, but the replica is not.

pg14-prod-inst1-8d4d-0      3/4     Running   2          71m
pg14-prod-inst1-r9mv-0      4/4     Running   0          71m
pg14-prod-repo-host-0       2/2     Running   0          71m
pgo-788784fc78-kvtqz        1/1     Running   1          102m
pgo-upgrade-b9b5d9b-csxj5   1/1     Running   1          102m

The logs of the failed container:

2022-11-17 11:13:57,669 INFO: Lock owner: pg14-prod-inst1-r9mv-0; I am pg14-prod-inst1-8d4d-0
2022-11-17 11:13:57,681 INFO: Local timeline=None lsn=None
2022-11-17 11:13:57,681 INFO: Lock owner: pg14-prod-inst1-r9mv-0; I am pg14-prod-inst1-8d4d-0
2022-11-17 11:13:57,682 INFO: starting as a secondary
2022-11-17 11:13:57 UTC [3779]: [1-1] user=,db=,app=,client= LOG:  pgaudit extension initialized
2022-11-17 11:13:57,879 INFO: postmaster pid=3779
/tmp/postgres:5432 - no response
2022-11-17 11:13:58,887 ERROR: postmaster is not running
2022-11-17 11:14:07,662 WARNING: Postgresql is not running.
2022-11-17 11:14:07,662 INFO: Lock owner: pg14-prod-inst1-r9mv-0; I am pg14-prod-inst1-8d4d-0
2022-11-17 11:14:07,666 INFO: pg_controldata:
  pg_control version number: 1300
  Catalog version number: 202107181
  Database system identifier: 7166932571582574678
  Database cluster state: in production
  pg_control last modified: Thu Nov 17 10:38:53 2022
  Latest checkpoint location: 0/202A628
  Latest checkpoint's REDO location: 0/2000028
  Latest checkpoint's REDO WAL file: 000000010000000000000002
  Latest checkpoint's TimeLineID: 1
  Latest checkpoint's PrevTimeLineID: 1
  Latest checkpoint's full_page_writes: on
  Latest checkpoint's NextXID: 0:750
  Latest checkpoint's NextOID: 24576
  Latest checkpoint's NextMultiXactId: 1
  Latest checkpoint's NextMultiOffset: 0
  Latest checkpoint's oldestXID: 726
  Latest checkpoint's oldestXID's DB: 1
  Latest checkpoint's oldestActiveXID: 750
  Latest checkpoint's oldestMultiXid: 1
  Latest checkpoint's oldestMulti's DB: 1
  Latest checkpoint's oldestCommitTsXid: 0
  Latest checkpoint's newestCommitTsXid: 0
  Time of latest checkpoint: Thu Nov 17 10:38:52 2022
  Fake LSN counter for unlogged rels: 0/3E8
  Minimum recovery ending location: 0/0
  Min recovery ending loc's timeline: 0
  Backup start location: 0/0
  Backup end location: 0/0
  End-of-backup record required: no
  wal_level setting: logical
  wal_log_hints setting: on
  max_connections setting: 2000
  max_worker_processes setting: 8
  max_wal_senders setting: 10
  max_prepared_xacts setting: 0
  max_locks_per_xact setting: 64
  track_commit_timestamp setting: off
  Maximum data alignment: 8
  Database block size: 8192
  Blocks per segment of large relation: 131072
  WAL block size: 8192
  Bytes per WAL segment: 16777216
  Maximum length of identifiers: 64
  Maximum columns in an index: 32
  Maximum size of a TOAST chunk: 1996
  Size of a large-object chunk: 2048
  Date/time type storage: 64-bit integers
  Float8 argument passing: by value
  Data page checksum version: 1
  Mock authentication nonce: a2578edc86c7ceebde779ed34d96616daaacacdc136864515a44fe9a04d29994

2022-11-17 11:14:07,678 INFO: doing crash recovery in a single user mode
2022-11-17 11:14:07,679 ERROR: Error when reading postmaster.opts
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/patroni/postgresql/rewind.py", line 407, in read_postmaster_opts
    with open(os.path.join(self._postgresql.data_dir, 'postmaster.opts')) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/pgdata/pg14/postmaster.opts'
2022-11-17 11:14:07,697 ERROR: Crash recovery finished with code=-7
2022-11-17 11:14:07,698 INFO:  stdout=
2022-11-17 11:14:07,698 INFO:  stderr=
2022-11-17 11:14:17,658 WARNING: Postgresql is not running.
2022-11-17 11:14:17,658 INFO: Lock owner: pg14-prod-inst1-r9mv-0; I am pg14-prod-inst1-8d4d-0
2022-11-17 11:14:17,662 INFO: pg_controldata:
  pg_control version number: 1300
  Catalog version number: 202107181
  Database system identifier: 7166932571582574678
  Database cluster state: in production
  pg_control last modified: Thu Nov 17 10:38:53 2022
  Latest checkpoint location: 0/202A628
  Latest checkpoint's REDO location: 0/2000028
  Latest checkpoint's REDO WAL file: 000000010000000000000002
  Latest checkpoint's TimeLineID: 1
  Latest checkpoint's PrevTimeLineID: 1
  Latest checkpoint's full_page_writes: on
  Latest checkpoint's NextXID: 0:750
  Latest checkpoint's NextOID: 24576
  Latest checkpoint's NextMultiXactId: 1
  Latest checkpoint's NextMultiOffset: 0
  Latest checkpoint's oldestXID: 726
  Latest checkpoint's oldestXID's DB: 1
  Latest checkpoint's oldestActiveXID: 750
  Latest checkpoint's oldestMultiXid: 1
  Latest checkpoint's oldestMulti's DB: 1
  Latest checkpoint's oldestCommitTsXid: 0
  Latest checkpoint's newestCommitTsXid: 0
  Time of latest checkpoint: Thu Nov 17 10:38:52 2022
  Fake LSN counter for unlogged rels: 0/3E8
  Minimum recovery ending location: 0/0
  Min recovery ending loc's timeline: 0
  Backup start location: 0/0
  Backup end location: 0/0
  End-of-backup record required: no
  wal_level setting: logical
  wal_log_hints setting: on
  max_connections setting: 2000
  max_worker_processes setting: 8
  max_wal_senders setting: 10
  max_prepared_xacts setting: 0
  max_locks_per_xact setting: 64
  track_commit_timestamp setting: off
  Maximum data alignment: 8
  Database block size: 8192
  Blocks per segment of large relation: 131072
  WAL block size: 8192
  Bytes per WAL segment: 16777216
  Maximum length of identifiers: 64
  Maximum columns in an index: 32
  Maximum size of a TOAST chunk: 1996
  Size of a large-object chunk: 2048
  Date/time type storage: 64-bit integers
  Float8 argument passing: by value
  Data page checksum version: 1
  Mock authentication nonce: a2578edc86c7ceebde779ed34d96616daaacacdc136864515a44fe9a04d29994

My values.yaml for my PGO Chart:

---
# controllerImages are used to run the PostgresCluster and PGUpgrade controllers.
controllerImages:
  cluster: image_repo/postgres-operator:ubi8-5.1.1-0
  upgrade: image_repo/postgres-operator-upgrade:ubi8-5.1.1-0

# relatedImages are used when an image is omitted from PostgresCluster or PGUpgrade specs.
relatedImages:
  postgres_14:
    image: image_repo/crunchy-postgres:ubi8-14.3-0
  postgres_14_gis_3.1:
    image: image_repo/crunchy-postgres-gis:ubi8-14.3-3.1-0
  postgres_14_gis_3.2:
    image: image_repo/crunchy-postgres-gis:ubi8-14.3-3.2-0
  postgres_13:
    image: image_repo/crunchy-postgres:ubi8-13.7-0
  postgres_13_gis_3.0:
    image: image_repo/crunchy-postgres-gis:ubi8-13.7-3.0-0
  postgres_13_gis_3.1:
    image: image_repo/crunchy-postgres-gis:ubi8-13.7-3.1-0
  pgadmin:
    image: image_repo/crunchy-pgadmin4:ubi8-4.30-1
  pgbackrest:
    image: image_repo/crunchy-pgbackrest:ubi8-2.38-1
  pgbouncer:
    image: image_repo/crunchy-pgbouncer:ubi8-1.16-3
  pgexporter:
    image: image_repo/crunchy-postgres-exporter:ubi8-5.1.1-0
  pgupgrade:
    image: image_repo/crunchy-upgrade:ubi8-5.1.1-0
# singleNamespace controls where PGO watches for PostgresClusters. When false,
# PGO watches for and responds to PostgresClusters in all namespaces. When true,
# PGO watches only the namespace in which it is installed.
singleNamespace: false

# debug allows you to enable or disable the "debug" level of logging.
debug: true

disable_check_for_upgrades: true

# imagePullSecretNames is a list of secret names to use for pulling controller images.
# More info: https://kubernetes.io/docs/concepts/containers/images/#specifying-imagepullsecrets-on-a-pod
imagePullSecretNames: []
resources:
  requests:
    memory: "64Mi"
    cpu: "2m"
  limits:
    memory: "256Mi"
    cpu: "100m"

My values.yaml file for the postgres cluster:

# Production Database Example
# Remember to go through all the values before deploying!

name: pg14-prod
postgresVersion: 14

#instanceCPU: 8
#instanceMemory: "32Gi"

pgBackRestConfig:
 global:
#   server: 
#    tls-server-address: "::"
#   server-ping:
#    tls-server-address: localhost
  repo1-retention-full: "1"
  repo1-retention-full-type: count
 manual:
  options:
   - --type=full
  repoName: repo1
 repos:
  - name: repo1
    schedules:
     full: 0 0 * * *
    volume:
     volumeClaimSpec:
      accessModes:
       - ReadWriteOnce
      storageClassName: "robin-nflw-gid-1987"
      resources:
       requests:
        storage: "600Gi"

 repoHost:
  resources:
   limits:
    cpu: "200m"
    memory: "128Mi"
    #hugepages-2Mi: "500Mi"
   #requests:
    #cpu: "200m"
    #memory: "128Mi"

 restore:
  enabled: false
  repoName: repo1
  resources:
   limits:
    cpu: "200m"
    memory: "128Mi"
    #hugepages-2Mi: "500Mi"
   #requests:
    #cpu: "200m"
    #memory: "128Mi"

 jobs:
  resources:
   limits:
    cpu: "200m"
    memory: "128Mi"
    #hugepages-2Mi: "500Mi"
   #requests:
    #cpu: "200m"
    #memory: "128Mi"

 sidecars:
  pgbackrest:
   resources:
    limits:
     cpu: "200m"
     memory: "128Mi"
     #hugepages-2Mi: "500Mi"
    #requests:
     #cpu: "200m"
     #memory: "128Mi"
  pgbackrestConfig:
   resources:
    limits:
     cpu: "200m"
     memory: "128Mi"
     #hugepages-2Mi: "500Mi"
    #requests:
     #cpu: "200m"
     #memory: "128Mi"

#pgBouncerConfig:
# - resources:
#    limits:
#     cpu: 200m
#     memory: 128Mi
#   sidecars:
#    pgbouncerConfig:
#     resources:
#      limits:
#       cpu: 200m
#       memory: 128Mi

#dataSource:
# - resources:
#    limits:
#     cpu: 200m
#     memory: 128Mi

instances:
 - dataVolumeClaimSpec:
    accessModes:
     - ReadWriteOnce
    storageClassName: "robin-nflw-gid-1987"
    resources:
     requests:
      storage: "200Gi"
   name: inst1
   replicas: 2
   resources:
    limits:
     cpu: 8
     memory: "32Gi"
     hugepages-2Mi: "0"
    requests:
     cpu: 8
     memory: "32Gi"

   sidecars:
    replicaCertCopy:
     resources:
      limits:
       cpu: "200m"
       memory: "128Mi"
       hugepages-2Mi: "0"
      #requests:
       #cpu: "200m"
       #memory: "128Mi"

   walVolumeClaimSpec:
    accessModes:
     - ReadWriteOnce
    storageClassName: "robin-nflw-gid-1987"
    resources:
      requests:
       storage: "12Gi"
   affinity:
    podAntiAffinity:
     preferredDuringSchedulingIgnoredDuringExecution:
     - weight: 1
       podAffinityTerm:
        topologyKey: kubernetes.io/hostname
        labelSelector:
         matchLabels:
          postgres-operator.crunchydata.com/cluster: pg14-prod
          postgres-operator.crunchydata.com/instance-set: pg14-prod-inst1

openshift: false
patroni:
 dynamicConfiguration:
  postgresql:
   parameters:
    TimeZone: Europe/Helsinki
    listen_addresses: '::'
    autovacuum_freeze_max_age: 1000000000
    autovacuum_max_workers: 10
    autovacuum_multixact_freeze_max_age: 1000000000
    autovacuum_naptime: 5s
    autovacuum_vacuum_cost_delay: 0
    effective_cache_size: 21GB
    idle_in_transaction_session_timeout: 30min
    log_autovacuum_min_duration: 1min
    log_line_prefix: '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h '
    log_lock_waits: true
    log_min_duration_statement: 6s
    log_parameter_max_length: 100
    log_temp_files: 100MB
    maintenance_work_mem: 128MB
    max_connections: 2000
    min_wal_size: 1GB
    max_wal_size: 10GB
    pg_stat_statements.track: all
    shared_buffers: 8GB
    shared_preload_libraries: pg_stat_statements
    tcp_keepalives_count: 9
    tcp_keepalives_idle: 7200
    tcp_keepalives_interval: 75
    track_activity_query_size: 2048
    vacuum_freeze_min_age: 1000000000
    vacuum_freeze_table_age: 1000000000
    vacuum_multixact_freeze_min_age: 5000000
    vacuum_multixact_freeze_table_age: 150000000
    wal_level: replica
    work_mem: 16MB

users:
 - name: postgres

 - databases:
    - archivedb
   name: archive

Solutions Tried So Far

Added the listen_address field in patroni.dynamicConfiguration.postgresql.parameters in the above YAML.

Created the following Config Map:

apiVersion: v1
data:
  pgbr-custom.conf: |
    [global:server]
    tls-server-address = ::

    [global:server-ping]
    tls-server-address = localhost
kind: ConfigMap
metadata:
  creationTimestamp: "2022-11-16T13:55:08Z"
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:data:
        .: {}
        f:pgbr-custom.conf: {}
    manager: kubectl
    operation: Update
    time: "2022-11-16T13:55:08Z"
  name: my-pgbr-config
  namespace: ns-nk-4gc-b2c-nflw
  resourceVersion: "39083266"
  uid: 029859e7-83ff-49f7-aea1-6df237c2c5ac

Added the annotation for IPv6 in my templates/postgres.yaml

apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
  name: {{ default .Release.Name .Values.name }}
  annotations:
    postgres-operator.crunchydata.com/pgbackrest-ip-version: IPv6
spec:
  postgresVersion: {{ required "You must set the version of Postgres to deploy." .Values.postgresVersion }}
  {{- if .Values.postGISVersion }}
  postGISVersion: {{ quote .Values.postGISVersion }}
  {{- end }}
  {{- if .Values.imagePostgres }}
  image: {{ .Values.imagePostgres | quote }}
  {{- end }}
  {{- if .Values.port }}
  port: {{ .Values.port }}
  {{- end }}
  {{- if .Values.instances }}
  instances:
{{ toYaml .Values.instances | indent 4 }}
  {{- else }}
  instances:
    - name: {{ default "instance1" .Values.instanceName | quote }}
      replicas: {{ default 1 .Values.instanceReplicas }}
      dataVolumeClaimSpec:
        accessModes:
        - "ReadWriteOnce"
        resources:
          requests:
            storage: {{ default "1Gi" .Values.instanceSize | quote }}
      {{- if or .Values.instanceMemory .Values.instanceCPU }}
      resources:
        limits:
          cpu: {{ default "" .Values.instanceCPU | quote }}
          memory: {{ default "" .Values.instanceMemory | quote }}
      {{- end }}
  {{- end }}
  backups:
    pgbackrest:
      {{- if .Values.imagePgBackRest }}
      image: {{ .Values.imagePgBackRest | quote }}
      {{- end }}
      {{- if .Values.pgBackRestConfig }}
{{ toYaml .Values.pgBackRestConfig | indent 6 }}
      {{- else if .Values.multiBackupRepos }}
      configuration:
      - secret:
          name: {{ default .Release.Name .Values.name }}-pgbackrest-secret
      global:
        {{- range $index, $repo := .Values.multiBackupRepos }}
        {{- if or $repo.s3 $repo.gcs $repo.azure }}
        repo{{ add $index 1 }}-path: /pgbackrest/{{ $.Release.Namespace }}/{{ default $.Release.Name $.Values.name }}/repo{{ add $index 1                           }}
        {{- end }}
        {{- end }}
      repos:
      {{- range $index, $repo := .Values.multiBackupRepos }}
      - name: repo{{ add $index 1 }}
        {{- if $repo.volume }}
        volume:
          volumeClaimSpec:
            accessModes:
            - "ReadWriteOnce"
            resources:
              requests:
                storage: {{ default "1Gi" $repo.volume.backupsSize | quote }}
        {{- else if $repo.s3 }}
        s3:
          bucket: {{ $repo.s3.bucket | quote }}
          endpoint: {{ $repo.s3.endpoint | quote }}
          region: {{ $repo.s3.region | quote }}
        {{- else if $repo.gcs }}
        gcs:
          bucket: {{ $repo.gcs.bucket | quote }}
        {{- else if $repo.azure }}
        azure:
          container: {{ $repo.azure.container | quote }}
        {{- end }}
      {{- end }}
      {{- else if .Values.s3 }}
      configuration:
      - secret:
          name: {{ default .Release.Name .Values.name }}-pgbackrest-secret
      global:
        repo1-path: /pgbackrest/{{ .Release.Namespace }}/{{ default .Release.Name .Values.name }}/repo1
        {{- if .Values.s3.encryptionPassphrase }}
        repo1-cipher-type: aes-256-cbc
        {{- end }}
      repos:
      - name: repo1
        s3:
          bucket: {{ .Values.s3.bucket | quote }}
          endpoint: {{ .Values.s3.endpoint | quote }}
          region: {{ .Values.s3.region | quote }}
      {{- else if .Values.gcs }}
      configuration:
      - secret:
          name: {{ default .Release.Name .Values.name }}-pgbackrest-secret
      global:
        repo1-path: /pgbackrest/{{ .Release.Namespace }}/{{ default .Release.Name .Values.name }}/repo1
      repos:
      - name: repo1
        gcs:
          bucket: {{ .Values.gcs.bucket | quote }}
      {{- else if .Values.azure }}
      configuration:
      - secret:
          name: {{ default .Release.Name .Values.name }}-pgbackrest-secret
      global:
        repo1-path: /pgbackrest/{{ .Release.Namespace }}/{{ default .Release.Name .Values.name }}/repo1
      repos:
      - name: repo1
        azure:
          container: {{ .Values.azure.container | quote }}
      {{- else }}
      repos:
      - name: repo1
        volume:
          volumeClaimSpec:
            accessModes:
            - "ReadWriteOnce"
            resources:
              requests:
                storage: {{ default "1Gi" .Values.backupsSize | quote }}
      {{- end }}
  {{- if or .Values.pgBouncerReplicas .Values.pgBouncerConfig }}
  proxy:
    pgBouncer:
      {{- if .Values.imagePgBouncer }}
      image: {{ .Values.imagePgBouncer | quote }}
      {{- end }}
      {{- if .Values.pgBouncerConfig }}
{{ toYaml .Values.pgBouncerConfig | indent 6 }}
      {{- else }}
      replicas: {{ .Values.pgBouncerReplicas }}
      {{- end }}
  {{- end }}
  {{- if .Values.patroni }}
  patroni:
{{ toYaml .Values.patroni | indent 4 }}
  {{- end }}
  {{- if .Values.users }}
  users:
{{ toYaml .Values.users | indent 4 }}
  {{- end }}
  {{- if .Values.service }}
  service:
{{ toYaml .Values.service | indent 4 }}
  {{- end }}
  {{- if .Values.dataSource }}
  service:
{{ toYaml .Values.dataSource | indent 4 }}
  {{- end }}
  {{- if .Values.databaseInitSQL }}
  databaseInitSQL:
    name: {{ required "A ConfigMap name is required for running bootstrap SQL." .Values.databaseInitSQL.name | quote }}
    key: {{ required "A key in a ConfigMap containing any bootstrap SQL is required." .Values.databaseInitSQL.key | quote }}
  {{- end }}
  {{- if .Values.imagePullPolicy }}
  imagePullPolicy: {{ .Values.imagePullPolicy | quote }}
  {{- end }}
  {{- if .Values.imagePullSecrets }}
  imagePullSecrets:
{{ toYaml .Values.imagePullSecrets | indent 4 }}
  {{- end }}
  {{- if .Values.disableDefaultPodScheduling }}
  disableDefaultPodScheduling: true
  {{- end }}
  {{- if .Values.metadata }}
  metadata:
{{ toYaml .Values.metadata | indent 4 }}
  {{- end }}
  {{- if .Values.monitoring }}
  monitoring:
    pgmonitor:
      exporter:
        image: {{ default "" .Values.imageExporter | quote }}
        {{- if .Values.monitoringConfig }}
{{ toYaml .Values.monitoringConfig | indent 8 }}
        {{- end }}
  {{- end }}
  {{- if .Values.shutdown }}
  shutdown: true
  {{- end }}
  {{- if .Values.standby }}
  standby:
    enabled: {{ .Values.standby.enabled }}
    repoName: {{ required "repoName must be set when enabling standby mode." .Values.standby.repoName }}
  {{- end }}
  {{- if .Values.supplementalGroups }}
  supplementalGroups:
{{ toYaml .Values.supplementalGroups | indent 4 }}
  {{- end }}
  {{- if .Values.openshift }}
  openshift: true
  {{- else if eq .Values.openshift false }}
  openshift: false
  {{- end }}
  {{- if .Values.customTLSSecret }}
  customTLSSecret:
{{ toYaml .Values.customTLSSecret | indent 4 }}
  {{- end }}
  {{- if .Values.customReplicationTLSSecret }}
  customReplicationTLSSecret:
{{ toYaml .Values.customReplicationTLSSecret | indent 4 }}
  {{- end }}

I'm not 100% sure that this is an IPv6 issue, but since that's the only difference from my lab environment, it's the first thing I tried to troubleshoot.

cbandy commented 1 year ago

Added the annotation for IPv6 in my templates/postgres.yaml

The annotation solution is merged but hasn't been released yet.

Created the following Config Map:

This is the workaround described in https://github.com/CrunchyData/postgres-operator/issues/3286#issuecomment-1261646438.

It works with the PGO you have, but there's one more step to finish it out. Mention the new ConfigMap in your values.yaml:

pgBackRestConfig:
 configuration:                         # 👈 These two lines
  - configMap: { name: my-pgbr-config } # 👈
 global:
#   server: 
#    tls-server-address: "::"
#   server-ping:
#    tls-server-address: localhost
  repo1-retention-full: "1"
  repo1-retention-full-type: count

Let me know how it goes!

benjaminjb commented 1 year ago

Hi @scriptac, did the above help to get your replicas working as expected in the IPv6 environment?