bitnami / charts

Bitnami Helm Charts
https://bitnami.com
Other
9.06k stars 9.23k forks source link

[bitnami/postgresql-ha] pgpool unable to get the available postgres primary node #30656

Open sahanasreenath opened 12 hours ago

sahanasreenath commented 12 hours ago

Name and Version

bitnami/postgresql-repmgr:12.20.0-debian-12-r29

What architecture are you using?

None

What steps will reproduce the bug?

In GKE, I have pg-pool with 1 replica and postgresql statefulset with 3 replicas.

here is my helm chart values

postgresql-ha:
  # -- Set the name to indicate this is part of PCS (note release name will be added)
  nameOverride: pcs
  # -- PostgreSQL persistence configuration
  persistence:
    # -- PVC Storage Request for PostgreSQL volume
    size: 250Gi

  # -- (map) Parameters for `postgresql`
  # @default -- i-documented-this
  postgresql:
    # -- PostgreSQL username
    username: datarobot
    # -- PostgreSQL database
    database: datarobot
    # password: # left blank, will be auto-generated
    extraEnvVarsSecret: pcs-postgresql-initdb-cfg
    # -- PostgreSQL password using existing secret
    existingSecret: pcs-postgresql
    # -- Set number of replicas.
    replicaCount: 3
    # Additional pg extensions
    sharedPreloadLibraries: "pgcrypto, pgaudit, repmgr"
    # -- Regmgr ext upgrade to false
    upgradeRepmgrExtension: false
    repmgrConnectTimeout: 3
    repmgrFenceOldPrimary: true
    # Override to use custom image / registry
    image:
      # -- Debug logs should be enabled
      debug: true
      # -- PostgreSQL image registry
      registry: docker.io
      # -- PostgreSQL image repository
      repository: bitnami/postgresql-repmgr
      # -- PostgreSQL image tag
      tag: 12.20.0-debian-12-r29

    # -- (map) Set resource [requests and limits](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/) for postgresql containers
    # @default -- i-documented-this
    resources:
      requests:
        cpu: "4"
        memory: "4Gi"
      limits:
        cpu: "6"
        memory: "8Gi"

    pdb:
      # -- create a Pod disruption budget for PostgreSQL with Repmgr
      create: false

    # -- Make repmgr to use password files instead of environment variables
    repmgrUsePassfile: true

    # -- Set maximum total connections
    maxConnections: 600
    # -- (map) Set [pod affinity])[https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/]
    # Prefer to separate pods having the "postgresql" label.
    # @default -- i-documented-this
    affinity:
      podAntiAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 90
          podAffinityTerm:
            labelSelector:
              matchExpressions:
              - key: app.kubernetes.io/component
                operator: In
                values:
                - postgresql
            topologyKey: topology.kubernetes.io/zone
        - weight: 10
          podAffinityTerm:
            labelSelector:
              matchExpressions:
              - key: app.kubernetes.io/component
                operator: In
                values:
                - postgresql
            topologyKey: kubernetes.io/hostname

  # -- (map) Parameters for `pgpool`
  # @default -- i-documented-this
  pgpool:
    # -- Config parameters to get the right primary when there is a switchover
    configuration: |-
      failover_on_backend_error = 'on'
      search_primary_node_timeout = 0
    # -- Use Pgpool Load-Balancing
    useLoadBalancing: false
    # -- Time in seconds to disconnect a client if it remains idle since the last query (PGPOOL_CLIENT_IDLE_LIMIT).
    clientIdleLimit: 30000
    # -- Maximum number of client connections in each child process (PGPOOL_CHILD_MAX_CONNECTIONS).
    childMaxConnections: 0
    # -- Time in seconds to terminate the cached connections to the PostgreSQL backend (PGPOOL_CONNECTION_LIFE_TIME).
    connectionLifeTime: 0
    # -- The number of preforked Pgpool-II server processes. It is also the concurrent connections limit to Pgpool-II from clients.
    # Must be a positive integer. (PGPOOL_NUM_INIT_CHILDREN)
    numInitChildren: 96
    # -- The maximum number of cached connections in each child process (PGPOOL_MAX_POOL).
    maxPool: 2
    # -- Name of a secret containing the usernames and passwords of accounts that will be added to pgpool_passwd.
    customUsersSecret: pcs-pgpool-custom-users
    # -- Pgpool admin password using existing secret
    existingSecret: pcs-pgpool
    image:
      # Override to use custom image / registry
      registry: docker.io
      repository: bitnami/pgpool
      tag: 4.5.4-debian-12-r7
    pdb:
      # -- Set to create a pod disruption budget
      create: false
    # -- Set the number of replicas
    replicaCount: 3

When helm upgrade happens where new pods come up and the old pod goes away when the new node is healthy. Helm upgrade just fails with an error

liveness prod errored: command healthcheck.sh timed out unhealthy

The new node is not healthy

pcs-pgpool-65d957b987-bj7zt                                       1/1     Running            1 (5h12m ago)    5h14m
pcs-pgpool-689fdfb8f4-5zhlx                                       0/1     Running      

It took more than 1 hr and the non-running node vanished. A new node spun up which was healthy.

Postgres was up within 12 minutes of the helm upgrade but pgpool was unable to identify the primary

DEBUG: clear_node_info_list() - closing open connections
DEBUG: clear_node_info_list() - unlinking
[2024-11-27 14:06:00] [NOTICE] node 1001 has recovered, reconnecting
[2024-11-27 14:06:00] [NOTICE] notifying node "pcs-postgresql-2" (ID: 1002) to follow node 1001
INFO:  node 1002 received notification to follow node 1001
[2024-11-27 14:06:00] [NOTICE] monitoring cluster primary "pcs-postgresql-1" (ID: 1001)
[2024-11-27 14:06:06] [NOTICE] new standby "pcs-postgresql-2" (ID: 1002) has connected
[2024-11-27 14:06:13] [NOTICE] new standby "pcs-postgresql-0" (ID: 1000) has connected

Expected:

I expect that during the helm upgrade if a new node gets spun up, it needs to pick the primary node running which it fails to currently. Old pgpool is running and healthy and able to connect to postgres primary where as the new node can't.

What do you see instead?

Added in the above description

carrodher commented 10 hours ago

Hi, the issue may not be directly related to the Bitnami container image/Helm chart, but rather to how the application is being utilized, configured in your specific environment, or tied to a particular scenario that is not easy to reproduce on our side.

If you think that's not the case and want to contribute a solution, we'd like to invite you to create a pull request. The Bitnami team is excited to review your submission and offer feedback. You can find the contributing guidelines here.

Your contribution will greatly benefit the community. Please feel free to contact us if you have any questions or need assistance.

Suppose you have any questions about the application, customizing its content, or technology and infrastructure usage. In that case, we highly recommend that you refer to the forums and user guides provided by the project responsible for the application or technology.

With that said, we'll keep this ticket open until the stale bot automatically closes it, in case someone from the community contributes valuable insights.