CrunchyData / postgres-operator

Production PostgreSQL for Kubernetes, from high availability Postgres clusters to full-scale database-as-a-service.
https://access.crunchydata.com/documentation/postgres-operator/v5/
Apache License 2.0
3.91k stars 587 forks source link

Bug: Updated image is not applied in cluster #3026

Closed Richard87 closed 2 years ago

Richard87 commented 2 years ago

Overview

I changed the image from registry.developers.crunchydata.com/crunchydata/crunchy-postgres:centos8-14.0-0 to registry.developers.crunchydata.com/crunchydata/crunchy-postgres:centos8-14.1-0 in a test cluster, but nothing changed.

Since it was a single instance db (no replicas), i changed replicas to 2, hoping it would create a rolling update, but no change here either, and no secondary instance was brought up.

EDIT I think this line from the operator log, is the most telling, but I dont know what to do about it:

unable to find instance name for pgBackRest restore Job

time="2022-02-09T12:12:21Z" level=error msg="Reconciler error" error="unable to find instance name for pgBackRest restore Job" file="internal/controller/postgrescluster/pgbackrest.go:1355" func="postgrescluster.(*Reconciler).reconcilePostgresClusterDataSource" name=eportaldb namespace=richard reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.3-0

/EDIT

Environment

EXPECTED

  1. Update the container image
  2. Create a new replica

ACTUAL

  1. No changes to image
  2. Replica not created

Logs

Postgres:

bash-4.4$ cat postgresql-Wed.log 
2022-02-09 11:38:46.967 UTC [2100437] LOG:  received fast shutdown request
2022-02-09 11:38:46.971 UTC [2100437] LOG:  aborting any active transactions
2022-02-09 11:38:46.976 UTC [2100499] FATAL:  terminating connection due to administrator command
2022-02-09 11:38:46.982 UTC [2100437] LOG:  background worker "logical replication launcher" (PID 2100781) exited with exit code 1
2022-02-09 11:38:46.983 UTC [2100475] LOG:  shutting down
2022-02-09 11:38:47.258 UTC [2451631] FATAL:  the database system is shutting down
2022-02-09 11:38:47.260 UTC [2451632] FATAL:  the database system is shutting down
2022-02-09 11:38:47.420 UTC [2451634] FATAL:  the database system is shutting down
2022-02-09 11:38:47.422 UTC [2451635] FATAL:  the database system is shutting down
2022-02-09 11:38:47.436 UTC [2451636] FATAL:  the database system is shutting down
2022-02-09 11:38:47.438 UTC [2451637] FATAL:  the database system is shutting down
2022-02-09 11:38:47.445 UTC [2451638] FATAL:  the database system is shutting down
2022-02-09 11:38:47.447 UTC [2451639] FATAL:  the database system is shutting down
2022-02-09 11:38:47.459 UTC [2451640] FATAL:  the database system is shutting down
2022-02-09 11:38:47.461 UTC [2451641] FATAL:  the database system is shutting down
2022-02-09 11:38:47.471 UTC [2451642] FATAL:  the database system is shutting down
2022-02-09 11:38:47.476 UTC [2451643] FATAL:  the database system is shutting down
2022-02-09 11:38:49.646 UTC [2451644] FATAL:  the database system is shutting down
2022-02-09 11:38:49.649 UTC [2451645] FATAL:  the database system is shutting down
2022-02-09 11:38:49.658 UTC [2451646] FATAL:  the database system is shutting down
2022-02-09 11:38:49.660 UTC [2451647] FATAL:  the database system is shutting down
2022-02-09 11:38:49.969 UTC [2100437] LOG:  database system is shut down
2022-02-09 11:39:18.080 UTC [86] LOG:  starting PostgreSQL 14.0 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.4.1 20200928 (Red Hat 8.4.1-1), 64-bit
2022-02-09 11:39:18.081 UTC [86] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2022-02-09 11:39:18.081 UTC [86] LOG:  listening on IPv6 address "::", port 5432
2022-02-09 11:39:18.088 UTC [86] LOG:  listening on Unix socket "/tmp/postgres/.s.PGSQL.5432"
2022-02-09 11:39:18.158 UTC [89] LOG:  database system was shut down at 2022-02-09 11:38:47 UTC
2022-02-09 11:39:18.175 UTC [91] FATAL:  the database system is starting up
2022-02-09 11:39:18.254 UTC [94] FATAL:  the database system is starting up
2022-02-09 11:39:18.537 UTC [89] LOG:  entering standby mode
2022-02-09 11:39:18.851 UTC [89] LOG:  restored log file "0000000E.history" from archive
2022-02-09 11:39:19.113 UTC [89] LOG:  consistent recovery state reached at D/2E0000A0
2022-02-09 11:39:19.113 UTC [89] LOG:  invalid record length at D/2E0000A0: wanted 24, got 0
2022-02-09 11:39:19.114 UTC [86] LOG:  database system is ready to accept read-only connections
2022-02-09 11:39:19.694 UTC [89] LOG:  received promote request
2022-02-09 11:39:19.694 UTC [89] LOG:  redo is not required
2022-02-09 11:39:22.625 UTC [89] LOG:  selected new timeline ID: 15
2022-02-09 11:39:22.757 UTC [89] LOG:  archive recovery complete
2022-02-09 11:39:24.880 UTC [89] LOG:  restored log file "0000000E.history" from archive
2022-02-09 11:39:24.904 UTC [86] LOG:  database system is ready to accept connections

Postgres 2:

2022-02-09 11:39:17,397 INFO: No PostgreSQL configuration items changed, nothing to reload.
2022-02-09 11:39:17,434 WARNING: Postgresql is not running.
2022-02-09 11:39:17,434 INFO: Lock owner: None; I am eportaldb-main-77fb-0
2022-02-09 11:39:17,443 INFO: pg_controldata:
  pg_control version number: 1300
  Catalog version number: 202107181
  Database system identifier: 7034087418294739021
  Database cluster state: shut down
  pg_control last modified: Wed Feb  9 11:38:47 2022
  Latest checkpoint location: D/2E000028
  Latest checkpoint's REDO location: D/2E000028
  Latest checkpoint's REDO WAL file: 0000000E0000000D0000002E
  Latest checkpoint's TimeLineID: 14
  Latest checkpoint's PrevTimeLineID: 14
  Latest checkpoint's full_page_writes: on
  Latest checkpoint's NextXID: 0:342409
  Latest checkpoint's NextOID: 74262
  Latest checkpoint's NextMultiXactId: 1540
  Latest checkpoint's NextMultiOffset: 3242
  Latest checkpoint's oldestXID: 726
  Latest checkpoint's oldestXID's DB: 1
  Latest checkpoint's oldestActiveXID: 0
  Latest checkpoint's oldestMultiXid: 1
  Latest checkpoint's oldestMulti's DB: 1
  Latest checkpoint's oldestCommitTsXid: 0
  Latest checkpoint's newestCommitTsXid: 0
  Time of latest checkpoint: Wed Feb  9 11:38:47 2022
  Fake LSN counter for unlogged rels: 0/3E8
  Minimum recovery ending location: 0/0
  Min recovery ending loc's timeline: 0
  Backup start location: 0/0
  Backup end location: 0/0
  End-of-backup record required: no
  wal_level setting: logical
  wal_log_hints setting: on
  max_connections setting: 100
  max_worker_processes setting: 8
  max_wal_senders setting: 10
  max_prepared_xacts setting: 0
  max_locks_per_xact setting: 64
  track_commit_timestamp setting: off
  Maximum data alignment: 8
  Database block size: 8192
  Blocks per segment of large relation: 131072
  WAL block size: 8192
  Bytes per WAL segment: 16777216
  Maximum length of identifiers: 64
  Maximum columns in an index: 32
  Maximum size of a TOAST chunk: 1996
  Size of a large-object chunk: 2048
  Date/time type storage: 64-bit integers
  Float8 argument passing: by value
  Data page checksum version: 1
  Mock authentication nonce: 8722ec43373144b9e2c496f251e0d11dc91d891c8c53c30071e42a2d1352059a

2022-02-09 11:39:17,470 INFO: Lock owner: None; I am eportaldb-main-77fb-0
2022-02-09 11:39:17,527 INFO: starting as a secondary
2022-02-09 11:39:18.056 UTC [86] LOG:  pgaudit extension initialized
2022-02-09 11:39:18.080 UTC [86] LOG:  redirecting log output to logging collector process
2022-02-09 11:39:18.080 UTC [86] HINT:  Future log output will appear in directory "log".
2022-02-09 11:39:18,152 INFO: postmaster pid=86
/tmp/postgres:5432 - rejecting connections
/tmp/postgres:5432 - rejecting connections
/tmp/postgres:5432 - accepting connections
2022-02-09 11:39:19,456 INFO: establishing a new patroni connection to the postgres cluster
2022-02-09 11:39:19,643 INFO: promoted self to leader by acquiring session lock
server promoting
2022-02-09 11:39:19,660 INFO: cleared rewind state after becoming the leader
2022-02-09 11:39:19,656 INFO: Lock owner: eportaldb-main-77fb-0; I am eportaldb-main-77fb-0
2022-02-09 11:39:19,711 INFO: updated leader lock during promote
2022-02-09 11:39:26,008 INFO: no action. I am (eportaldb-main-77fb-0) the leader with the lock
...
2022-02-09 11:48:36,522 INFO: no action. I am (eportaldb-main-77fb-0) the leader with the lock

PGO:

time="2022-02-09T11:51:24Z" level=debug msg="debug flag set to true" file="cmd/postgres-operator/main.go:62" func=main.main version=5.0.4-0
I0209 11:51:25.434537       1 request.go:655] Throttling request took 1.356870736s, request: GET:https://10.56.0.1:443/apis/postgres-operator.crunchydata.com/v1beta1?timeout=32s
time="2022-02-09T11:51:26Z" level=info msg="metrics server is starting to listen" addr=":8080" file="sigs.k8s.io/controller-runtime@v0.8.3/pkg/log/deleg.go:130" func="log.(*DelegatingLogger).Info" version=5.0.4-0
time="2022-02-09T11:51:28Z" level=info msg="starting controller runtime manager and will wait for signal to exit" file="cmd/postgres-operator/main.go:83" func=main.main version=5.0.4-0
time="2022-02-09T11:51:28Z" level=info msg="Starting EventSource" file="sigs.k8s.io/controller-runtime@v0.8.3/pkg/internal/controller/controller.go:165" func="controller.(*Controller).Start.func1" reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster source="kind source: /, Kind=" version=5.0.4-0
time="2022-02-09T11:51:28Z" level=info msg="starting metrics server" file="sigs.k8s.io/controller-runtime@v0.8.3/pkg/manager/internal.go:385" func="manager.(*controllerManager).serveMetrics.func2" path=/metrics version=5.0.4-0
time="2022-02-09T11:51:28Z" level=info msg="Starting EventSource" file="sigs.k8s.io/controller-runtime@v0.8.3/pkg/internal/controller/controller.go:165" func="controller.(*Controller).Start.func1" reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster source="kind source: /, Kind=" version=5.0.4-0
time="2022-02-09T11:51:28Z" level=info msg="Starting EventSource" file="sigs.k8s.io/controller-runtime@v0.8.3/pkg/internal/controller/controller.go:165" func="controller.(*Controller).Start.func1" reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster source="kind source: /, Kind=" version=5.0.4-0
time="2022-02-09T11:51:28Z" level=info msg="Starting EventSource" file="sigs.k8s.io/controller-runtime@v0.8.3/pkg/internal/controller/controller.go:165" func="controller.(*Controller).Start.func1" reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster source="kind source: /, Kind=" version=5.0.4-0
time="2022-02-09T11:51:30Z" level=info msg="Starting EventSource" file="sigs.k8s.io/controller-runtime@v0.8.3/pkg/internal/controller/controller.go:165" func="controller.(*Controller).Start.func1" reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster source="kind source: /, Kind=" version=5.0.4-0
time="2022-02-09T11:51:32Z" level=info msg="Starting EventSource" file="sigs.k8s.io/controller-runtime@v0.8.3/pkg/internal/controller/controller.go:165" func="controller.(*Controller).Start.func1" reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster source="kind source: /, Kind=" version=5.0.4-0
time="2022-02-09T11:51:32Z" level=info msg="Starting EventSource" file="sigs.k8s.io/controller-runtime@v0.8.3/pkg/internal/controller/controller.go:165" func="controller.(*Controller).Start.func1" reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster source="kind source: /, Kind=" version=5.0.4-0
time="2022-02-09T11:51:32Z" level=info msg="Starting EventSource" file="sigs.k8s.io/controller-runtime@v0.8.3/pkg/internal/controller/controller.go:165" func="controller.(*Controller).Start.func1" reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster source="kind source: /, Kind=" version=5.0.4-0
time="2022-02-09T11:51:32Z" level=info msg="Starting EventSource" file="sigs.k8s.io/controller-runtime@v0.8.3/pkg/internal/controller/controller.go:165" func="controller.(*Controller).Start.func1" reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster source="kind source: /, Kind=" version=5.0.4-0
time="2022-02-09T11:51:32Z" level=info msg="Starting EventSource" file="sigs.k8s.io/controller-runtime@v0.8.3/pkg/internal/controller/controller.go:165" func="controller.(*Controller).Start.func1" reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster source="kind source: /, Kind=" version=5.0.4-0
time="2022-02-09T11:51:32Z" level=info msg="Starting EventSource" file="sigs.k8s.io/controller-runtime@v0.8.3/pkg/internal/controller/controller.go:165" func="controller.(*Controller).Start.func1" reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster source="kind source: /, Kind=" version=5.0.4-0
time="2022-02-09T11:51:32Z" level=info msg="Starting EventSource" file="sigs.k8s.io/controller-runtime@v0.8.3/pkg/internal/controller/controller.go:165" func="controller.(*Controller).Start.func1" reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster source="kind source: /, Kind=" version=5.0.4-0
time="2022-02-09T11:51:32Z" level=info msg="Starting EventSource" file="sigs.k8s.io/controller-runtime@v0.8.3/pkg/internal/controller/controller.go:165" func="controller.(*Controller).Start.func1" reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster source="kind source: /, Kind=" version=5.0.4-0
time="2022-02-09T11:51:33Z" level=info msg="Starting EventSource" file="sigs.k8s.io/controller-runtime@v0.8.3/pkg/internal/controller/controller.go:165" func="controller.(*Controller).Start.func1" reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster source="kind source: /, Kind=" version=5.0.4-0
time="2022-02-09T11:51:33Z" level=info msg="Starting EventSource" file="sigs.k8s.io/controller-runtime@v0.8.3/pkg/internal/controller/controller.go:165" func="controller.(*Controller).Start.func1" reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster source="kind source: /, Kind=" version=5.0.4-0
time="2022-02-09T11:51:33Z" level=info msg="Starting Controller" file="sigs.k8s.io/controller-runtime@v0.8.3/pkg/internal/controller/controller.go:173" func="controller.(*Controller).Start.func1" reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.4-0
time="2022-02-09T11:51:33Z" level=info msg="Starting workers" file="sigs.k8s.io/controller-runtime@v0.8.3/pkg/internal/controller/controller.go:211" func="controller.(*Controller).Start.func1" reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.4-0 worker count=2
time="2022-02-09T11:51:33Z" level=error msg="Reconciler error" error="unable to find instance name for pgBackRest restore Job" file="internal/controller/postgrescluster/pgbackrest.go:1367" func="postgrescluster.(*Reconciler).reconcilePostgresClusterDataSource" name=eportaldb namespace=maja reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.4-0
time="2022-02-09T11:51:33Z" level=error msg="Reconciler error" error="unable to find instance name for pgBackRest restore Job" file="internal/controller/postgrescluster/pgbackrest.go:1367" func="postgrescluster.(*Reconciler).reconcilePostgresClusterDataSource" name=eportaldb namespace=richard reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.4-0
time="2022-02-09T11:51:34Z" level=debug msg="replaced configuration" file="internal/patroni/api.go:86" func=patroni.Executor.ReplaceConfiguration name=eportaldb namespace=eportal reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster stderr= stdout="Not changed\n" version=5.0.4-0
time="2022-02-09T11:51:34Z" level=debug msg="reconciled instance" file="internal/controller/postgrescluster/instance.go:1094" func="postgrescluster.(*Reconciler).reconcileInstance" instance=eportaldb-main-qrfh name=eportaldb namespace=eportal reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.4-0
time="2022-02-09T11:51:34Z" level=debug msg="reconciled instance" file="internal/controller/postgrescluster/instance.go:1094" func="postgrescluster.(*Reconciler).reconcileInstance" instance=eportaldb-main-77fb name=eportaldb namespace=eportal reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.4-0
time="2022-02-09T11:51:34Z" level=debug msg="reconciled instance set" file="internal/controller/postgrescluster/instance.go:988" func="postgrescluster.(*Reconciler).scaleUpInstances" instance-set=main name=eportaldb namespace=eportal reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.4-0
time="2022-02-09T11:51:34Z" level=debug msg="reconciled cluster" file="internal/controller/postgrescluster/controller.go:299" func="postgrescluster.(*Reconciler).Reconcile" name=eportaldb namespace=eportal reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.4-0
time="2022-02-09T11:51:34Z" level=debug msg="patched cluster status" file="internal/controller/postgrescluster/controller.go:171" func="postgrescluster.(*Reconciler).Reconcile.func2" name=eportaldb namespace=eportal reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.4-0
time="2022-02-09T11:51:34Z" level=error msg="Reconciler error" error="unable to find instance name for pgBackRest restore Job" file="internal/controller/postgrescluster/pgbackrest.go:1367" func="postgrescluster.(*Reconciler).reconcilePostgresClusterDataSource" name=eportaldb namespace=maja reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.4-0
time="2022-02-09T11:51:34Z" level=error msg="Reconciler error" error="unable to find instance name for pgBackRest restore Job" file="internal/controller/postgrescluster/pgbackrest.go:1367" func="postgrescluster.(*Reconciler).reconcilePostgresClusterDataSource" name=eportaldb namespace=richard reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.4-0
time="2022-02-09T11:51:34Z" level=debug msg="replaced configuration" file="internal/patroni/api.go:86" func=patroni.Executor.ReplaceConfiguration name=eportaldb namespace=morten reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster stderr= stdout="Not changed\n" version=5.0.4-0
time="2022-02-09T11:51:35Z" level=debug msg="reconciled instance" file="internal/controller/postgrescluster/instance.go:1094" func="postgrescluster.(*Reconciler).reconcileInstance" instance=eportaldb-main-77fb name=eportaldb namespace=morten reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.4-0
time="2022-02-09T11:51:35Z" level=debug msg="reconciled instance set" file="internal/controller/postgrescluster/instance.go:988" func="postgrescluster.(*Reconciler).scaleUpInstances" instance-set=main name=eportaldb namespace=morten reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.4-0
time="2022-02-09T11:51:35Z" level=debug msg="reconciled cluster" file="internal/controller/postgrescluster/controller.go:299" func="postgrescluster.(*Reconciler).Reconcile" name=eportaldb namespace=morten reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.4-0
time="2022-02-09T11:51:35Z" level=error msg="Reconciler error" error="unable to find instance name for pgBackRest restore Job" file="internal/controller/postgrescluster/pgbackrest.go:1367" func="postgrescluster.(*Reconciler).reconcilePostgresClusterDataSource" name=eportaldb namespace=maja reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.4-0
time="2022-02-09T11:51:35Z" level=error msg="Reconciler error" error="unable to find instance name for pgBackRest restore Job" file="internal/controller/postgrescluster/pgbackrest.go:1367" func="postgrescluster.(*Reconciler).reconcilePostgresClusterDataSource" name=eportaldb namespace=richard reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.4-0
time="2022-02-09T11:51:35Z" level=error msg="Reconciler error" error="unable to find instance name for pgBackRest restore Job" file="internal/controller/postgrescluster/pgbackrest.go:1367" func="postgrescluster.(*Reconciler).reconcilePostgresClusterDataSource" name=eportaldb namespace=maja reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.4-0
time="2022-02-09T11:51:35Z" level=error msg="Reconciler error" error="unable to find instance name for pgBackRest restore Job" file="internal/controller/postgrescluster/pgbackrest.go:1367" func="postgrescluster.(*Reconciler).reconcilePostgresClusterDataSource" name=eportaldb namespace=richard reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.4-0
time="2022-02-09T11:51:35Z" level=error msg="Reconciler error" error="unable to find instance name for pgBackRest restore Job" file="internal/controller/postgrescluster/pgbackrest.go:1367" func="postgrescluster.(*Reconciler).reconcilePostgresClusterDataSource" name=eportaldb namespace=maja reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.4-0
time="2022-02-09T11:51:35Z" level=error msg="Reconciler error" error="unable to find instance name for pgBackRest restore Job" file="internal/controller/postgrescluster/pgbackrest.go:1367" func="postgrescluster.(*Reconciler).reconcilePostgresClusterDataSource" name=eportaldb namespace=richard reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.4-0
time="2022-02-09T11:51:35Z" level=error msg="Reconciler error" error="unable to find instance name for pgBackRest restore Job" file="internal/controller/postgrescluster/pgbackrest.go:1367" func="postgrescluster.(*Reconciler).reconcilePostgresClusterDataSource" name=eportaldb namespace=maja reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.4-0
time="2022-02-09T11:51:35Z" level=error msg="Reconciler error" error="unable to find instance name for pgBackRest restore Job" file="internal/controller/postgrescluster/pgbackrest.go:1367" func="postgrescluster.(*Reconciler).reconcilePostgresClusterDataSource" name=eportaldb namespace=richard reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.4-0

Additional Information

The cluster definition:

apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: {...}
  creationTimestamp: '2021-12-17T16:54:08Z'
  finalizers:
    - postgres-operator.crunchydata.com/finalizer
  generation: 6
  managedFields: ...
  name: eportaldb
  namespace: richard
  resourceVersion: '571875251'
  uid: 726145b6-c57c-4892-8796-2142a04633a8
  selfLink: >-
    /apis/postgres-operator.crunchydata.com/v1beta1/namespaces/richard/postgresclusters/eportaldb
status:
  conditions:
    - lastTransitionTime: '2022-01-06T20:33:46Z'
      message: pgBackRest restore failed
      observedGeneration: 1
      reason: PGBackRestRestoreFailed
      status: 'False'
      type: PostgresDataInitialized
    - lastTransitionTime: '2021-12-17T16:55:50Z'
      message: pgBackRest dedicated repository host is ready
      observedGeneration: 1
      reason: RepoHostReady
      status: 'True'
      type: PGBackRestRepoHostReady
    - lastTransitionTime: '2021-12-17T16:55:52Z'
      message: pgBackRest replica create repo is ready for backups
      observedGeneration: 1
      reason: StanzaCreated
      status: 'True'
      type: PGBackRestReplicaRepoReady
    - lastTransitionTime: '2021-12-17T16:56:41Z'
      message: pgBackRest replica creation is now possible
      observedGeneration: 1
      reason: RepoBackupComplete
      status: 'True'
      type: PGBackRestReplicaCreate
    - lastTransitionTime: '2021-12-17T16:55:36Z'
      message: Deployment has minimum availability.
      observedGeneration: 1
      reason: MinimumReplicasAvailable
      status: 'True'
      type: ProxyAvailable
  databaseRevision: 59ddcd657
  instances:
    - name: main
      readyReplicas: 1
      replicas: 1
      updatedReplicas: 1
  monitoring:
    exporterConfiguration: 559c4c97d6
  observedGeneration: 1
  patroni:
    systemIdentifier: '7034087418294739021'
  pgbackrest:
    repoHost:
      apiVersion: apps/v1
      kind: StatefulSet
      ready: true
    repos:
      - bound: true
        name: repo1
        replicaCreateBackupComplete: true
        stanzaCreated: true
        volume: pvc-ac433c51-01a9-48be-9713-f30eaf439a65
    restore:
      active: 1
      finished: true
      id: ~pgo-bootstrap-eportaldb
      startTime: '2022-01-07T12:02:22Z'
  proxy:
    pgBouncer:
      postgresRevision: b7675cf8b
      readyReplicas: 1
      replicas: 1
  usersRevision: 64b84479fd
spec:
  backups:
    pgbackrest:
      image: >-
        registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:centos8-2.35-0
      repos:
        - name: repo1
          schedules:
            incremental: 0 * * * *
          volume:
            volumeClaimSpec:
              accessModes:
                - ReadWriteOnce
              resources:
                requests:
                  storage: 10Gi
  dataSource:
    postgresCluster:
      clusterName: eportaldb
      clusterNamespace: eportal
      repoName: repo1
  image: >-
    registry.developers.crunchydata.com/crunchydata/crunchy-postgres:centos8-14.1-0
  instances:
    - affinity:
        podAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - podAffinityTerm:
                labelSelector:
                  matchLabels:
                    io.kompose.service: eportal
                topologyKey: kubernetes.io/hostname
              weight: 100
      dataVolumeClaimSpec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 10Gi
      name: main
      replicas: 2
      resources:
        limits:
          cpu: 500m
          memory: 500Mi
        requests:
          cpu: 500m
          memory: 500Mi
  port: 5432
  postgresVersion: 14
  proxy:
    pgBouncer:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - podAffinityTerm:
                labelSelector:
                  matchLabels:
                    postgres-operator.crunchydata.com/cluster: eportaldb
                    postgres-operator.crunchydata.com/role: pgbouncer
                topologyKey: kubernetes.io/hostname
              weight: 1
      image: >-
        registry.developers.crunchydata.com/crunchydata/crunchy-pgbouncer:centos8-1.16-0
      port: 5432
      replicas: 1
tjmoore4 commented 2 years ago

@Richard87 could you provide the specific steps, commands, etc. that you used to create the cluster in question? It appears you may have been attempting a clone across namespaces, and the errors may be related to an initial failure of the restore process.

Richard87 commented 2 years ago

Hi @tjmoore4 ! Yes,we are cloning accross namespaces, and that works great! (but maybe it shouldnt for security purposes?)

To create a cluster, we apply this yaml, often many times to reset the staging cluster as a clone of production:

apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
  name: eportaldb
spec:
  image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres:centos8-14.0-0
  backups:
    pgbackrest:
      image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:centos8-2.35-0
      repos:
        - name: repo1
          volume:
            volumeClaimSpec:
              accessModes:
                - ReadWriteOnce
              resources:
                requests:
                  storage: 10Gi
  instances:
    - dataVolumeClaimSpec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 10Gi
      affinity:
        podAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - podAffinityTerm:
                labelSelector:
                  matchLabels:
                    io.kompose.service: eportal
                topologyKey: kubernetes.io/hostname
              weight: 100
      name: main
      resources:
        limits:
          memory: 500Mi
          cpu: 500m
        requests:
          memory: 500Mi
          cpu: 500m
      replicas: 1
  postgresVersion: 14
  dataSource:
    postgresCluster:
      repoName: repo1
      clusterName: eportaldb
      clusterNamespace: eportal
  proxy:
    pgBouncer:
      image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbouncer:centos8-1.16-0
      port: 5432
      replicas: 1
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - podAffinityTerm:
                labelSelector:
                  matchLabels:
                    postgres-operator.crunchydata.com/cluster: eportaldb
                    postgres-operator.crunchydata.com/role: pgbouncer
                topologyKey: kubernetes.io/hostname
              weight: 1
jmckulk commented 2 years ago

Hey @Richard87, it sounds like you are leaving the DataSource in your spec after you complete the clone. Is this correct?

We recommend removing that section from your spec after the clone has completed. If you are leaving it in, try to perform a clone, remove or comment out the DataSource section, then try to replicate this issue. Make sure to check that the cluster is in a healthy state between steps. Please try this and let us know if you continue to run into this issue.

Richard87 commented 2 years ago

Thanks, something seemed to have changed, and the operator does trigger requesten changes now(even with the data source in).

I managed to upgrade the staging cluster to 14.1, and it seems scheduled backups works as expected as well!

I will run som tests on the product ion cluster next weekend, and see if it works just as good!

jmckulk commented 2 years ago

Sounds good! I'll go ahead and close this issue but feel free to re-open if you run into any issues.

Richard87 commented 2 years ago

Hi @jmckulk I had the same error again today and wonder if this issue should be reopened.

When updating starting to update the cluster to v5.1 ( Changing the image versions in use) the reconciling failed with the same error as before, I also had the same dataSource active in the yaml, but I dont think that was the reason for the failure.

I think it was because the clusted didn't have any backups (no schedule setup for repo1), and therefor failed to reconcile the update image with the error unable to find instance name for pgBackRest restore Job.

My solution was to delete the cluster, and re-create a new one with the same name (this time with a backup schedule configured!).

So, I think the operator should create a one-of backup if the error occurs, and use that newly created backup to continue setting up a new cluster.

Also if possible, not allow that failure to stop other changes from beeing completed, but it might be incredibly complex to cover all cases!