[bitnami/postgresql-ha] pod status CrashLoopBackOff

isFxh commented 2 months ago

Name and Version

bitnami/postgresql-ha 16.3.0

What architecture are you using?

amd64

What steps will reproduce the bug?

`After I deployed the postgresql-ha chart through helm, I restarted a machine in the k8s cluster and observed that one or more instance pods were in the "CrashLoopBackOff" state. I checked the crashed pod log throughlogs` and found that "/tmp/repmgr.pid exists" in the log, as shown in the figure below. I had to manually delete the corresponding pod before I could recover.

helm -n dev install --create-namespace postgresql . (My environment is deployed by adjusting the values.yaml file)
kubectl -n dev get pods --selector app.kubernetes.io/name=postgresql-ha -owide -w 3.Shutown a server in the cluster (This is related to the scenario. We need to test high availability and automatic service recovery.)
Start a stopped server.
kubectl -n dev get pods --selector app.kubernetes.io/name=postgresql-ha -owide -w The above steps are used to reproduce this problem in my environment. This problem is not necessarily reproduced, but it has a high probability of recurrence.

values.yaml

global:
  imageRegistry: "my_repository"
  imagePullSecrets: []
  storageClass: ""
  postgresql:
    username: ""
    password: ""
    database: ""
    repmgrUsername: ""
    repmgrPassword: ""
    repmgrDatabase: ""
    existingSecret: ""
  ldap:
    bindpw: ""
    existingSecret: ""
  pgpool:
    adminUsername: ""
    adminPassword: ""
    existingSecret: ""
  compatibility:
    openshift:
      adaptSecurityContext: auto
kubeVersion: ""
nameOverride: ""
fullnameOverride: ""
namespaceOverride: ""
commonLabels: {}
commonAnnotations: {}
clusterDomain: cluster.local
extraDeploy: []
diagnosticMode:
  enabled: false
  command:
    - sleep
  args:
    - infinity
postgresql:
  image:
    registry: docker.io
    repository: postgresql-repmgr
    tag: 16.3.0-debian-12-r6
    digest: ""
    pullPolicy: IfNotPresent
    pullSecrets: []
    debug: true
  labels: {}
  podLabels: {}
  serviceAnnotations: {}
  replicaCount: 3
  updateStrategy:
    type: RollingUpdate
  containerPorts:
    postgresql: 5432
  automountServiceAccountToken: false
  hostAliases: []
  hostNetwork: false
  hostIPC: false
  podAnnotations: {}
  podAffinityPreset: ""
  podAntiAffinityPreset: soft
  nodeAffinityPreset:
    type: ""
    key: ""
    values: []
  affinity: {}
  nodeSelector: {}
  tolerations: []
  topologySpreadConstraints: []
  priorityClassName: ""
  schedulerName: ""
  terminationGracePeriodSeconds: ""
  podSecurityContext:
    enabled: true
    fsGroupChangePolicy: Always
    sysctls: []
    supplementalGroups: []
    fsGroup: 1001
  containerSecurityContext:
    enabled: true
    seLinuxOptions: null
    runAsUser: 1001
    runAsGroup: 1001
    runAsNonRoot: true
    privileged: false
    readOnlyRootFilesystem: true
    allowPrivilegeEscalation: false
    capabilities:
      drop: ["ALL"]
    seccompProfile:
      type: "RuntimeDefault"
  command: []
  args: []
  lifecycleHooks: {}
  extraEnvVars: []
  extraEnvVarsCM: ""
  extraEnvVarsSecret: ""
  extraVolumes: []
  extraVolumeMounts: []
  initContainers: []
  sidecars: []
  resourcesPreset: "micro"
  resources: {}
  podManagementPolicy: Parallel
  livenessProbe:
    enabled: true
    initialDelaySeconds: 30
    periodSeconds: 10
    timeoutSeconds: 5
    successThreshold: 1
    failureThreshold: 6
  readinessProbe:
    enabled: true
    initialDelaySeconds: 5
    periodSeconds: 10
    timeoutSeconds: 5
    successThreshold: 1
    failureThreshold: 6
  startupProbe:
    enabled: false
    initialDelaySeconds: 5
    periodSeconds: 10
    timeoutSeconds: 5
    successThreshold: 1
    failureThreshold: 10
  customLivenessProbe: {}
  customReadinessProbe: {}
  customStartupProbe: {}
  networkPolicy:
    enabled: true
    allowExternal: true
    allowExternalEgress: true
    extraIngress: []
    extraEgress: []
    ingressNSMatchLabels: {}
    ingressNSPodMatchLabels: {}
  pdb:
    create: false
    minAvailable: 1
    maxUnavailable: ""
  username: dbapp
  database: "ailpha"
  existingSecret: ""
  usePasswordFile: ""
  repmgrUsePassfile: ""
  repmgrPassfilePath: ""
  upgradeRepmgrExtension: false
  pgHbaTrustAll: false
  syncReplication: false
  syncReplicationMode: ""
  repmgrUsername: repmgr
  repmgrDatabase: repmgr
  repmgrLogLevel: NOTICE
  repmgrConnectTimeout: 5
  repmgrReconnectAttempts: 2
  repmgrReconnectInterval: 3
  repmgrFenceOldPrimary: false
  repmgrChildNodesCheckInterval: 5
  repmgrChildNodesConnectedMinCount: 1
  repmgrChildNodesDisconnectTimeout: 30
  usePgRewind: false
  audit:
    logHostname: true
    logConnections: false
    logDisconnections: false
    pgAuditLog: ""
    pgAuditLogCatalog: "off"
    clientMinMessages: error
    logLinePrefix: ""
    logTimezone: ""
  sharedPreloadLibraries: "pgaudit, repmgr"
  maxConnections: ""
  postgresConnectionLimit: ""
  dbUserConnectionLimit: ""
  tcpKeepalivesInterval: ""
  tcpKeepalivesIdle: ""
  tcpKeepalivesCount: ""
  statementTimeout: ""
  pghbaRemoveFilters: ""
  extraInitContainers: []
  repmgrConfiguration: ""
  configuration: ""
  pgHbaConfiguration: ""
  configurationCM: ""
  extendedConf: ""
  extendedConfCM: ""
  initdbScripts: {}
  initdbScriptsCM: ""
  initdbScriptsSecret: ""
  tls:
    enabled: false
    preferServerCiphers: true
    certificatesSecret: ""
    certFilename: ""
    certKeyFilename: ""
  preStopDelayAfterPgStopSeconds: 25
  headlessWithNotReadyAddresses: false
witness:
  create: true
  labels: {}
  podLabels: {}
  replicaCount: 2
  updateStrategy:
    type: RollingUpdate
  containerPorts:
    postgresql: 5432
  automountServiceAccountToken: false
  hostAliases: []
  hostNetwork: false
  hostIPC: false
  podAnnotations: {}
  podAffinityPreset: ""
  podAntiAffinityPreset: soft
  nodeAffinityPreset:
    type: ""
    key: ""
    values: []
  affinity: {}
  nodeSelector: {}
  tolerations: []
  topologySpreadConstraints: []
  priorityClassName: ""
  schedulerName: ""
  terminationGracePeriodSeconds: ""
  podSecurityContext:
    enabled: true
    fsGroupChangePolicy: Always
    sysctls: []
    supplementalGroups: []
    fsGroup: 1001
  containerSecurityContext:
    enabled: true
    seLinuxOptions: null
    runAsUser: 1001
    runAsGroup: 1001
    runAsNonRoot: true
    privileged: false
    readOnlyRootFilesystem: true
    allowPrivilegeEscalation: false
    capabilities:
      drop: ["ALL"]
    seccompProfile:
      type: "RuntimeDefault"
  command: []
  args: []
  lifecycleHooks: {}
  extraEnvVars: []
  extraEnvVarsCM: ""
  extraEnvVarsSecret: ""
  extraVolumes: []
  extraVolumeMounts: []
  initContainers: []
  sidecars: []
  resourcesPreset: "micro"
  resources: {}
  livenessProbe:
    enabled: true
    initialDelaySeconds: 30
    periodSeconds: 10
    timeoutSeconds: 5
    successThreshold: 1
    failureThreshold: 6
  readinessProbe:
    enabled: true
    initialDelaySeconds: 5
    periodSeconds: 10
    timeoutSeconds: 5
    successThreshold: 1
    failureThreshold: 6
  startupProbe:
    enabled: false
    initialDelaySeconds: 5
    periodSeconds: 10
    timeoutSeconds: 5
    successThreshold: 1
    failureThreshold: 10
  customLivenessProbe: {}
  customReadinessProbe: {}
  customStartupProbe: {}
  pdb:
    create: false
    minAvailable: 1
    maxUnavailable: ""
  upgradeRepmgrExtension: false
  pgHbaTrustAll: false
  repmgrLogLevel: NOTICE
  repmgrConnectTimeout: 5
  repmgrReconnectAttempts: 2
  repmgrReconnectInterval: 3
  audit:
    logHostname: true
    logConnections: false
    logDisconnections: false
    pgAuditLog: ""
    pgAuditLogCatalog: "off"
    clientMinMessages: error
    logLinePrefix: ""
    logTimezone: ""
  maxConnections: ""
  postgresConnectionLimit: ""
  dbUserConnectionLimit: ""
  tcpKeepalivesInterval: ""
  tcpKeepalivesIdle: ""
  tcpKeepalivesCount: ""
  statementTimeout: ""
  pghbaRemoveFilters: ""
  extraInitContainers: []
  repmgrConfiguration: ""
  configuration: ""
  pgHbaConfiguration: ""
  configurationCM: ""
  extendedConf: ""
  extendedConfCM: ""
  initdbScripts: {}
  initdbScriptsCM: ""
  initdbScriptsSecret: ""
pgpool:
  image:
    registry: docker.io
    repository: pgpool
    tag: 4.5.1-debian-12-r5
    digest: ""
    pullPolicy: IfNotPresent
    pullSecrets: []
    debug: true
  customUsers:
    usernames: "postgres"
  automountServiceAccountToken: false
  hostAliases: []
  customUsersSecret: ""
  existingSecret: ""
  srCheckDatabase: postgres
  labels: {}
  podLabels: {}
  serviceLabels: {}
  serviceAnnotations: {}
  customLivenessProbe: {}
  customReadinessProbe: {}
  customStartupProbe: {}
  command: []
  args: []
  lifecycleHooks: {}
  extraEnvVars: []
  extraEnvVarsCM: ""
  extraEnvVarsSecret: ""
  extraVolumes: []
  extraVolumeMounts: []
  initContainers: []
  sidecars: []
  replicaCount: 2
  podAnnotations: {}
  priorityClassName: ""
  schedulerName: ""
  terminationGracePeriodSeconds: ""
  topologySpreadConstraints: []
  podAffinityPreset: ""
  podAntiAffinityPreset: soft
  nodeAffinityPreset:
    type: ""
    key: ""
    values: []
  affinity: {}
  nodeSelector: {}
  tolerations: []
  podSecurityContext:
    enabled: true
    fsGroupChangePolicy: Always
    sysctls: []
    supplementalGroups: []
    fsGroup: 1001
  containerSecurityContext:
    enabled: true
    seLinuxOptions: null
    runAsUser: 1001
    runAsGroup: 1001
    runAsNonRoot: true
    privileged: false
    readOnlyRootFilesystem: true
    allowPrivilegeEscalation: false
    capabilities:
      drop: ["ALL"]
    seccompProfile:
      type: "RuntimeDefault"
  resourcesPreset: "micro"
  resources: {}
  livenessProbe:
    enabled: true
    initialDelaySeconds: 30
    periodSeconds: 10
    timeoutSeconds: 5
    successThreshold: 1
    failureThreshold: 5
  readinessProbe:
    enabled: true
    initialDelaySeconds: 5
    periodSeconds: 5
    timeoutSeconds: 5
    successThreshold: 1
    failureThreshold: 5
  startupProbe:
    enabled: false
    initialDelaySeconds: 5
    periodSeconds: 10
    timeoutSeconds: 5
    successThreshold: 1
    failureThreshold: 10
  networkPolicy:
    enabled: true
    allowExternal: true
    allowExternalEgress: true
    extraIngress: []
    extraEgress: []
    ingressNSMatchLabels: {}
    ingressNSPodMatchLabels: {}
  pdb:
    create: false
    minAvailable: 1
    maxUnavailable: ""
  updateStrategy: {}
  containerPorts:
    postgresql: 5432
  minReadySeconds: ""
  adminUsername: admin
  adminPassword: ""
  usePasswordFile: ""
  authenticationMethod: scram-sha-256
  logConnections: false
  logHostname: true
  logPerNodeStatement: false
  logLinePrefix: ""
  clientMinMessages: error
  numInitChildren: ""
  reservedConnections: 1
  maxPool: ""
  childMaxConnections: ""
  childLifeTime: ""
  clientIdleLimit: ""
  connectionLifeTime: ""
  useLoadBalancing: true
  disableLoadBalancingOnWrite: transaction
  configuration: ""
  configurationCM: ""
  initdbScripts: {}
  initdbScriptsCM: ""
  initdbScriptsSecret: ""
  tls:
    enabled: false
    autoGenerated: false
    preferServerCiphers: true
    certificatesSecret: ""
    certFilename: ""
    certKeyFilename: ""
    certCAFilename: ""
ldap:
  enabled: false
  existingSecret: ""
  uri: ""
  basedn: ""
  binddn: ""
  bindpw: ""
  bslookup: ""
  scope: ""
  tlsReqcert: ""
  nssInitgroupsIgnoreusers: root,nslcd
rbac:
  create: false
  rules: []
serviceAccount:
  create: true
  name: ""
  annotations: {}
  automountServiceAccountToken: false
psp:
  create: false
metrics:
  enabled: false
  image:
    registry: docker.io
    repository: bitnami/postgres-exporter
    tag: 0.15.0-debian-12-r31
    digest: ""
    pullPolicy: IfNotPresent
    pullSecrets: []
    debug: true
  podSecurityContext:
    enabled: true
    seLinuxOptions: null
    runAsUser: 1001
    runAsGroup: 1001
    runAsNonRoot: true
    seccompProfile:
      type: RuntimeDefault
  resourcesPreset: "nano"
  resources: {}
  containerPorts:
    http: 9187
  livenessProbe:
    enabled: true
    initialDelaySeconds: 30
    periodSeconds: 10
    timeoutSeconds: 5
    successThreshold: 1
    failureThreshold: 6
  readinessProbe:
    enabled: true
    initialDelaySeconds: 5
    periodSeconds: 10
    timeoutSeconds: 5
    successThreshold: 1
    failureThreshold: 6
  startupProbe:
    enabled: false
    initialDelaySeconds: 5
    periodSeconds: 10
    timeoutSeconds: 5
    successThreshold: 1
    failureThreshold: 10
  customLivenessProbe: {}
  customReadinessProbe: {}
  customStartupProbe: {}
  service:
    enabled: true
    type: ClusterIP
    ports:
      metrics: 9187
    nodePorts:
      metrics: ""
    clusterIP: ""
    loadBalancerIP: ""
    loadBalancerSourceRanges: []
    externalTrafficPolicy: Cluster
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "9187"
  customMetrics: {}
  extraEnvVars: []
  extraEnvVarsCM: ""
  extraEnvVarsSecret: ""
  serviceMonitor:
    enabled: false
    namespace: ""
    interval: ""
    scrapeTimeout: ""
    annotations: {}
    labels: {}
    selector:
      prometheus: kube-prometheus
    relabelings: []
    metricRelabelings: []
    honorLabels: false
    jobLabel: ""
volumePermissions:
  enabled: false
  image:
    registry: docker.io
    repository: bitnami/os-shell
    tag: 12-debian-12-r21
    digest: ""
    pullPolicy: IfNotPresent
    pullSecrets: []
  podSecurityContext:
    enabled: true
    seLinuxOptions: null
    runAsUser: 0
    runAsGroup: 0
    runAsNonRoot: false
    seccompProfile:
      type: RuntimeDefault
  resourcesPreset: "nano"
  resources: {}
persistence:
  enabled: true
  existingClaim: ""
  storageClass: ""
  mountPath: /bitnami/postgresql
  accessModes:
    - ReadWriteOnce
  size: 8Gi
  annotations: {}
  labels: {}
  selector: {}
persistentVolumeClaimRetentionPolicy:
  enabled: false
  whenScaled: Retain
  whenDeleted: Retain
service:
  type: NodePort
  ports:
    postgresql: 5432
  portName: postgresql
  nodePorts:
    postgresql: ""
  loadBalancerIP: ""
  loadBalancerSourceRanges: []
  clusterIP: ""
  externalTrafficPolicy: Cluster
  extraPorts: []
  sessionAffinity: "None"
  sessionAffinityConfig: {}
  annotations: {}
  serviceLabels: {}
  headless:
    annotations: {}
backup:
  enabled: false
  cronjob:
    schedule: "@daily"
    timeZone: ""
    concurrencyPolicy: Allow
    failedJobsHistoryLimit: 1
    successfulJobsHistoryLimit: 3
    startingDeadlineSeconds: ""
    ttlSecondsAfterFinished: ""
    restartPolicy: OnFailure
    podSecurityContext:
      enabled: true
      fsGroupChangePolicy: Always
      sysctls: []
      supplementalGroups: []
      fsGroup: 1001
    containerSecurityContext:
      enabled: true
      seLinuxOptions: null
      runAsUser: 1001
      runAsGroup: 1001
      runAsNonRoot: true
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      seccompProfile:
        type: RuntimeDefault
      capabilities:
        drop:
          - ALL
    command:
      - /bin/sh
      - -c
      - "pg_dumpall --clean --if-exists --load-via-partition-root --quote-all-identifiers --no-password --file=${PGDUMP_DIR}/pg_dumpall-$(date '+%Y-%m-%d-%H-%M').pgdump"
    labels: {}
    annotations: {}
    nodeSelector: {}
    storage:
      existingClaim: ""
      resourcePolicy: ""
      storageClass: ""
      accessModes:
        - ReadWriteOnce
      size: 8Gi
      annotations: {}
      mountPath: /backup/pgdump
      subPath: ""
      volumeClaimTemplates:

Are you using any custom parameters or values?

No response

What is the expected behavior?

No response

What do you see instead?

Additional information

No response

DCx14 commented 2 months ago

Hello,

I have the same problem ...

DCx14 commented 2 months ago

can you test this command?

kubectl debug -it -n=cluster-postgresql postgresql-ha-1721639657-pgpool-9bfd9cf75-l226d --image=nicolaka/netshoot

and tell me what you have when you do a netstat -an?

isFxh commented 2 months ago

can you test this command?

kubectl debug -it -n=cluster-postgresql postgresql-ha-1721639657-pgpool-9bfd9cf75-l226d --image=nicolaka/netshoot

and tell me what you have when you do a netstat -an?

Sorry, I can't execute the command you mentioned at the moment. My environment has been restored. I adjusted the script content of entrypoint.sh, added the detection of repmgr.pid file when the container is restarted or rebuilt, and remade the image of postgresql-repmgr.

#!/bin/bash
# Copyright Broadcom, Inc. All Rights Reserved.
# SPDX-License-Identifier: APACHE-2.0

# shellcheck disable=SC1091

set -o errexit
set -o nounset
set -o pipefail
#set -o xtrace

# Load libraries
. /opt/bitnami/scripts/liblog.sh
. /opt/bitnami/scripts/libbitnami.sh
. /opt/bitnami/scripts/libpostgresql.sh
. /opt/bitnami/scripts/librepmgr.sh

# Load PostgreSQL & repmgr environment variables
. /opt/bitnami/scripts/postgresql-env.sh
export MODULE=postgresql-repmgr

print_welcome_page

# Enable the nss_wrapper settings
postgresql_enable_nss_wrapper

# We add the copy from default config in the entrypoint to not break users
# bypassing the setup.sh logic. If the file already exists do not overwrite (in
# case someone mounts a configuration file in /opt/bitnami/postgresql/conf)
debug "Copying files from $POSTGRESQL_DEFAULT_CONF_DIR to $POSTGRESQL_CONF_DIR"
cp -nr "$POSTGRESQL_DEFAULT_CONF_DIR"/. "$POSTGRESQL_CONF_DIR"

info "Start postgresql, check pid for repmgr process."
if [ -f /tmp/repmgr.pid ];then 
  info "Is exists that repmgr pid file."
  rm -f /tmp/repmgr.pid
else
  info "Not exists that repmgr pid file, Skip."
fi

if [[ "$*" = *"/opt/bitnami/scripts/postgresql-repmgr/run.sh"* ]]; then
    info "** Starting PostgreSQL with Replication Manager setup **"
    /opt/bitnami/scripts/postgresql-repmgr/setup.sh
    touch "$POSTGRESQL_TMP_DIR"/.initialized
    info "** PostgreSQL with Replication Manager setup finished! **"
fi

echo ""
exec "$@"

javsalgar commented 2 months ago

Hi!

Thanks for sharing the fix! Would you like to submit a PR in bitnami/containers?

isFxh commented 2 months ago

Hi!

Thanks for sharing the fix! Would you like to submit a PR in bitnami/containers?

Hi, the above adjustments have temporarily fixed the problem in my environment, but I am not sure if there are other problems. Maybe I have bypassed other possible errors through detection.

DCx14 commented 2 months ago

Hi,

thank you for sharing :)

github-actions[bot] commented 2 months ago

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

gtjarks commented 2 months ago

Hi, we have the same problem. Have you found the reason for the problem?

github-actions[bot] commented 1 month ago

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

github-actions[bot] commented 1 month ago

Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.

bitnami / charts