CrunchyData / postgres-operator

Production PostgreSQL for Kubernetes, from high availability Postgres clusters to full-scale database-as-a-service.
https://access.crunchydata.com/documentation/postgres-operator/v5/
Apache License 2.0
3.93k stars 592 forks source link

2 replicas fail when using s3 with pgbackrest #3639

Closed joyartoun closed 8 months ago

joyartoun commented 1 year ago

Overview

Hello, I am using the operator on openshift and noticed that when deploying the postgrescluster with pgbackrest with s3 config there are some issues.

  1. If I deploy 2 replicas, one replica never goes into ready state. It spams errors, errors below.
  2. pgbackrest pods and containers never get created.

Environment

Steps to Reproduce

Apply the CR below in a openshift cluster running version 4.11 and with postgresoperator v 5.3.0 with s3 storage enabled.

Problem occurs when using the following postgresCluster definition:

---
# Source: applications/templates/postgresInstance.yaml
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
  name: joy
  namespace: joy
  finalizers:
    - postgres-operator.crunchydata.com/finalizer
spec:
  instances:
    - dataVolumeClaimSpec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 20Gi
      metadata:
        annotations:
          sidecar.istio.io/inject: 'false'
        labels:
          postgresInstanceName: instance1
      name: instance1
      replicas: 2
  postgresVersion: 14
  supplementalGroups:
    - 65534
  port: 5432
  users:
  - databases:
    - <redacted>
    name: <redacted>
  - databases:
    - <redacted>
    name: <redacted>
  - databases:
    - testdb
    name: testuser
  - name: <redacted>
    options: SUPERUSER
  monitoring:
    pgmonitor:
      exporter:
        image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres-exporter:ubi8-5.3.0-0
  userInterface:
    pgAdmin:
      dataVolumeClaimSpec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 1Gi
      image: >-
        registry.developers.crunchydata.com/crunchydata/crunchy-pgadmin4:ubi8-4.30-10
      metadata:
        annotations:
          sidecar.istio.io/inject: 'false'
        labels:
          postgresInstanceName: instance1
      replicas: 1
  backups:
    pgbackrest:
      global:
        archive-push-queue-max: 4G
        repo1-path: /pgbackrest/instance1/repo1
        repo1-s3-uri-style: path
        repo1-storage-verify-tls: 'n'
        repo1-storage-port: '9000'
      configuration:
      - secret:
          name: pgo-s3-creds-instance1-repo1
      manual:
        options:
          - '--type=full'
        repoName: repo1
      metadata:
        annotations:
          sidecar.istio.io/inject: 'false'
        labels:
          postgresInstanceName: instance1
      repos:
        - name: repo1
          schedules: 
            full: 0 6 * * *
            incremental: 0 */4 * * *
          s3:
            bucket: pgo-bucket
            endpoint: minio.example.com
            region: minio

REPRO

Provide steps to get to the error condition: Apply the CR above in a openshift cluster running version 4.11 and with postgresoperator v 5.3.0 with s3 storage enabled.

EXPECTED

  1. Two replicas working and backups sent to the s3 bucket.

ACTUAL

  1. One replica spams error messages and never comes up. Manual backup not working and not scheduled backups either.

Logs

2023-04-26 15:29:33,332 INFO: no action. I am (joy-instance1-qzjw-0), a secondary, and following a leader (joy-instance1-ctbc-0)
/tmp/postgres:5432 - rejecting connections
2023-04-26 15:29:43,230 INFO: Lock owner: joy-instance1-ctbc-0; I am joy-instance1-qzjw-0
2023-04-26 15:29:43,230 INFO: Still starting up as a standby.
2023-04-26 15:29:43,230 INFO: Lock owner: joy-instance1-ctbc-0; I am joy-instance1-qzjw-0
2023-04-26 15:29:43,231 INFO: establishing a new patroni connection to the postgres cluster
2023-04-26 15:29:43,983 INFO: establishing a new patroni connection to the postgres cluster
2023-04-26 15:29:43,985 WARNING: Retry got exception: 'connection problems'
2023-04-26 15:29:43,985 WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role
2023-04-26 15:29:43,986 INFO: no action. I am (joy-instance1-qzjw-0), a secondary, and following a leader (joy-instance1-ctbc-0)

In pgdata/pg14/log/ I see the following:

2023-04-26 15:30:36.298 UTC [527] FATAL:  the database system is starting up
2023-04-26 15:30:38.309 UTC [536] FATAL:  the database system is starting up
2023-04-26 15:30:40.320 UTC [539] FATAL:  the database system is starting up
2023-04-26 15:30:42.331 UTC [544] FATAL:  the database system is starting up
2023-04-26 15:30:43.204 UTC [547] FATAL:  the database system is starting up
2023-04-26 15:30:43.207 UTC [548] FATAL:  the database system is starting up

The following is written in the operator log regarding this postgres cluster:

time="2023-04-26T14:29:42Z" level=debug msg="skipping SSH reconciliation, no repo hosts configured" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster name=joy namespace=joy postgresCluster=joy/joy reconcileID=d560333f-476f-4781-a370-6791eb301393 reconcileResource=repoConfig version=5.3.0-0
time="2023-04-26T14:31:43Z" level=error msg="unable to create stanza" controller=postgrescluster controllerGroup=postgres-operator.crunchydata.com controllerKind=PostgresCluster error="command terminated with exit code 101: ERROR: [101]: TLS error [1:336130315] wrong version number\n" file="internal/controller/postgrescluster/pgbackrest.go:2584" func="postgrescluster.(*Reconciler).reconcileStanzaCreate" name=joy namespace=joy postgresCluster=joy/joy reconcileID=d560333f-476f-4781-a370-6791eb301393 reconciler=pgBackRest version=5.3.0-0
time="2023-04-26T14:31:43Z" level=debug msg=Warning message="command terminated with exit code 101: ERROR: [101]: TLS error [1:336130315] wrong version number\n" object="{PostgresCluster joy joy 35f135f2-2cc8-4a83-af00-fed881edba40 postgres-operator.crunchydata.com/v1beta1 335482190 }" reason=UnableToCreateStanzas version=5.3.0-0

Please advise

joyartoun commented 1 year ago

Hi, saw another issue which pointed to that the problem is related to the minio not having SSL enabled and indeed that was the issue. Is it possible to configure pgbackrest to run without SSL connection to the s3 instance?

tony-landreth commented 1 year ago

Hi @joyartoun! I don't think it's possible to run pgbackrest without TLS (see here). If you don't mind me asking, what's preventing the minio instance from having TLS enabled?

tjmoore4 commented 8 months ago

Since we haven't heard back on this issue for some time, I am closing this issue. If you need further assistance, feel free to re-open this issue or ask a question in our Discord server.