[PostgreSQL] Postgresql liveness and readiness probes failing

jotamartos commented 4 years ago

From @HenriqueLBorges: https://github.com/bitnami/bitnami-docker-postgresql/issues/222

Description

Describe the bug Hello, I have a kubernetes cluster running postgresql. There is no resources limitations, but in a random moment readiness/liveness probes fails and then my container is restarted.

Steps to reproduce the issue:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app: postgresql
    chart: postgresql-8.4.0
    heritage: Tiller
    release: postgres
  name: postgres-postgresql
  namespace: prd
spec:
  podManagementPolicy: OrderedReady
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: postgresql
      release: postgres
      role: master
  serviceName: postgres-postgresql-headless
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: postgresql
        chart: postgresql-8.4.0
        heritage: Tiller
        release: postgres
        role: master
      name: postgres-postgresql
    spec:
      containers:
      - env:
        - name: BITNAMI_DEBUG
          value: "false"
        - name: POSTGRESQL_PORT_NUMBER
          value: "5432"
        - name: POSTGRESQL_VOLUME_DIR
          value: //var/lib/postgresql/data
        - name: PGDATA
          value: /bitnami/postgresql/data
        - name: POSTGRES_USER
          value: postgres
        - name: POSTGRES_PASSWORD
          valueFrom:
            secretKeyRef:
              key: postgresql-password
              name: postgres-postgresql
        - name: POSTGRESQL_ENABLE_LDAP
          value: "no"
        image: docker.io/bitnami/postgresql:11.7.0-debian-10-r0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - exec pg_isready -U "postgres" -h 127.0.0.1 -p 5432
          failureThreshold: 6
          initialDelaySeconds: 30
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        name: postgres-postgresql
        ports:
        - containerPort: 5432
          name: tcp-postgresql
          protocol: TCP
        readinessProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - -e
            - |
              exec pg_isready -U "postgres" -h 127.0.0.1 -p 5432
              [ -f /opt/bitnami/postgresql/tmp/.initialized ] || [ -f /bitnami/postgresql/.initialized ]
          failureThreshold: 6
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        resources:
          requests:
            cpu: 250m
            memory: 256Mi
        securityContext:
          runAsUser: 1001
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
        - mountPath: //var/lib/postgresql/data
          name: data
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 1001
      terminationGracePeriodSeconds: 30
      volumes:
      - emptyDir:
          medium: Memory
          sizeLimit: 1Gi
        name: dshm
      - name: data
        persistentVolumeClaim:
          claimName: pvc0004
  updateStrategy:
    type: RollingUpdate

Describe the results you received:

kubectl describe pod

Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Tue, 26 May 2020 18:53:31 +0000
      Finished:     Tue, 26 May 2020 18:58:13 +0000
    Ready:          True
    Restart Count:  416

Events:
  Type     Reason     Age                   From                   Message
  ----     ------     ----                  ----                   -------
  Warning  Unhealthy  12m (x5081 over 54d)  kubelet, worker-k8s-4  Liveness probe failed: 127.0.0.1:5432 - no response
  Warning  Unhealthy  12m (x6286 over 54d)  kubelet, worker-k8s-4  Readiness probe failed: 127.0.0.1:5432 - no response

Additional information you deem important (e.g. issue happens only occasionally):

I ran pg_isready inside my container enumerous times and everytime I the following response:

127.0.0.1:5432 - accepting connections

I tried to execute big SQL statements and exceed the limit of connections, but I wasn't able to force a container restart. These restarts are all happening when my cluster isn't being heavy used.

Version

Output of docker version:

Client: Docker Engine - Community
 Version:           19.03.6
 API version:       1.40
 Go version:        go1.12.16
 Git commit:        369ce74a3c
 Built:             Thu Feb 13 01:28:06 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.6
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.16
  Git commit:       369ce74a3c
  Built:            Thu Feb 13 01:26:38 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.2.4
  GitCommit:        e6b3f5632f50dbc4e9cb6288d911bf4f5e95b18e
 runc:
  Version:          1.2.4
  GitCommit:        6635b4f0c6af3810594d2770f662f34ddc15b40d
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

Output of docker info:

Client:
 Debug Mode: false

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 0
 Server Version: 19.03.6
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: e6b3f5632f50dbc4e9cb6288d911bf4f5e95b18e
 runc version: 6635b4f0c6af3810594d2770f662f34ddc15b40d
 init version: fec3683
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 4.4.0-174-generic
 Operating System: Ubuntu 16.04.4 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 15.67GiB
 Name: master-k8s
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, Docker for MAC, physical, etc.):

Kubernetes version:

Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:14:22Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:07:13Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}

Os version:


Ubuntu 16.04.4 LTS

juan131 commented 4 years ago

Hi @HenriqueLBorges

Did you perform any upgrade in your PostgreSQL chart or did it happen after installing the chart without no human intervention at all? Sometimes these errors are related to credentials that change after upgrading the chart.

HenriqueLBorges commented 4 years ago

Hi @juan131

No, I didn't. I it did happen after installing the chart without no human intervention.

juan131 commented 4 years ago

That's weird...

Do you any monitoring system in place (e.g. Prometheus+Grafana)? If so, you could try to identify if there's a correlation between these restarts and high CPU/memory usage on the PostgreSQL container. Do you have any logging collector system (e.g. ELK or EFK)? You can also take a look to the logs of the container just before the containers get restarted to identify what's going on (please also share with us so we can take a look to them)

stale[bot] commented 4 years ago

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

stale[bot] commented 4 years ago

Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.

caalberts commented 4 years ago

Hi @juan131, I'm also encountering a similar problem where postgres pod is going into restart loop.

Chart: postgresql-8.9.4

Output of kubectl describe pod postgresql-pod

Events:
  Type     Reason     Age                   From                                                         Message
  ----     ------     ----                  ----                                                         -------
  Normal   Pulled     30m (x162 over 12h)   kubelet, gke-node  Container image "docker.io/bitnami/postgresql:11.7.0" already present on machine
  Warning  Unhealthy  20m (x1661 over 12h)  kubelet, gke-node  Readiness probe failed: 127.0.0.1:5432 - no response
  Warning  Unhealthy  11m (x996 over 12h)   kubelet, gke-node  Liveness probe failed: 127.0.0.1:5432 - no response
  Warning  BackOff    51s (x1990 over 12h)  kubelet, gke-node  Back-off restarting failed container

Output of kubectl logs postgresql-pod

postgresql 08:48:37.08
postgresql 08:48:38.60 Welcome to the Bitnami postgresql container
postgresql 08:48:39.47 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-postgresql
postgresql 08:48:40.27 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-postgresql/issues
postgresql 08:48:41.09
postgresql 08:48:49.91 INFO  ==> ** Starting PostgreSQL setup **
postgresql 08:48:59.37 INFO  ==> Validating settings in POSTGRESQL_* env vars..
postgresql 08:49:03.28 INFO  ==> Loading custom pre-init scripts...
postgresql 08:49:05.17 INFO  ==> Loading user's custom files from /docker-entrypoint-preinitdb.d ...
postgresql 08:49:09.58 INFO  ==> Initializing PostgreSQL database...
postgresql 08:49:11.38 INFO  ==> Cleaning stale /bitnami/postgresql/data/postmaster.pid file
postgresql 08:49:19.69 INFO  ==> postgresql.conf file not detected. Generating it...
postgresql 08:49:23.26 INFO  ==> pg_hba.conf file not detected. Generating it...
postgresql 08:49:25.18 INFO  ==> Generating local authentication configuration
postgresql 08:49:29.68 INFO  ==> Deploying PostgreSQL with persisted data...
postgresql 08:49:33.17 INFO  ==> Configuring replication parameters
postgresql 08:49:41.68 INFO  ==> Configuring fsync
postgresql 08:49:45.37 INFO  ==> Loading custom scripts...
postgresql 08:49:48.89 INFO  ==> Enabling remote connections
postgresql 08:49:51.56 INFO  ==> Stopping PostgreSQL...
postgresql 08:49:53.30 INFO  ==> ** PostgreSQL setup finished! **

postgresql 08:50:03.49 INFO  ==> ** Starting PostgreSQL **
2020-07-08 08:50:05.215 GMT [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2020-07-08 08:50:05.215 GMT [1] LOG:  listening on IPv6 address "::", port 5432
2020-07-08 08:50:05.221 GMT [1] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2020-07-08 08:50:05.275 GMT [261] LOG:  database system was interrupted; last known up at 2020-07-08 08:43:01 GMT
2020-07-08 08:50:05.490 GMT [261] LOG:  database system was not properly shut down; automatic recovery in progress
2020-07-08 08:50:05.495 GMT [261] LOG:  redo starts at 0/2F101E0
2020-07-08 08:50:05.499 GMT [261] LOG:  invalid record length at 0/2F35A30: wanted 24, got 0
2020-07-08 08:50:05.499 GMT [261] LOG:  redo done at 0/2F35A08
2020-07-08 08:50:05.499 GMT [261] LOG:  last completed transaction was at log time 2020-07-08 08:43:23.928121+00
2020-07-08 08:50:05.551 GMT [1] LOG:  database system is ready to accept connections
2020-07-08 08:50:26.015 GMT [293] FATAL:  terminating connection due to unexpected postmaster exit
2020-07-08 08:50:26.015 GMT [294] FATAL:  terminating connection due to unexpected postmaster exit
2020-07-08 08:50:26.015 GMT [295] FATAL:  terminating connection due to unexpected postmaster exit
2020-07-08 08:50:26.015 GMT [289] FATAL:  terminating connection due to unexpected postmaster exit
2020-07-08 08:50:26.015 GMT [292] FATAL:  terminating connection due to unexpected postmaster exit
2020-07-08 08:50:26.015 GMT [291] FATAL:  terminating connection due to unexpected postmaster exit
2020-07-08 08:50:26.015 GMT [275] FATAL:  terminating connection due to unexpected postmaster exit
2020-07-08 08:50:26.015 GMT [303] FATAL:  terminating connection due to unexpected postmaster exit
2020-07-08 08:50:26.017 GMT [290] FATAL:  terminating connection due to unexpected postmaster exit

juan131 commented 4 years ago

Hello @caalberts

Did you perform any upgrade of your PostgreSQL release recently? A very common issue is to upgrade the chart without indicating the original passwords that were generated the 1st time you installed. When this happens, the password in your secrets (the one used by readiness/liveness probes) is regenerated, and it gets is out of sync with the one available in your PostgreSQL data.

This is documented in the link below:

https://github.com/bitnami/charts/tree/master/bitnami/postgresql#upgrade

HenriqueLBorges commented 4 years ago

Hello @juan131

This is not my case, the readiness and liveness probes start to fail usually at times when I'm not heavily using it. Almost feels like is random.

juan131 commented 4 years ago

Hi @HenriqueLBorges

Could you share the output of running kubectl describe pod POSTGRESQL_POD, where _POSTGRESQLPOD is a placeholder for your actual PostgreSQL pod.

I'd like to see if the probes are providing information about the reason why they're failing.

HenriqueLBorges commented 4 years ago

Hi @juan131

Here is the output below:

Name:         postgres-postgresql-0
Namespace:    prd
Priority:     0
Node:         worker-k8s-4/172.20.8.6
Start Time:   Thu, 25 Jun 2020 21:29:08 +0000
Labels:       app=postgresql
              chart=postgresql-8.4.0
              controller-revision-hash=postgres-postgresql-7878b964b8
              heritage=Tiller
              release=postgres
              role=master
              security.istio.io/tlsMode=istio
              statefulset.kubernetes.io/pod-name=postgres-postgresql-0
Annotations:  sidecar.istio.io/status:
                {"version":"805d5a8f492b8fa20c7d92aac6e0cda9fe0f1fe63e5073b929d39ea721788f25","initContainers":["istio-init"],"containers":["istio-proxy"]...
Status:       Running
IP:           10.244.4.252
IPs:
  IP:           10.244.4.252
Controlled By:  StatefulSet/postgres-postgresql
Init Containers:
  init-chmod-data:
    Container ID:  docker://3f13e1e74ae2d74a908c709416dc636cdac67cfb71f631fd3b476c42d1d7f4c5
    Image:         docker.io/bitnami/minideb:buster
    Image ID:      docker-pullable://bitnami/minideb@sha256:b3623f09926bd762482eba7c0cf4dd65801fb0a649af3ca4fa2c6ee3f2866da0
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -cx

      mkdir -p /bitnami/postgresql/data
      chmod 700 /bitnami/postgresql/data
      find /bitnami/postgresql -mindepth 1 -maxdepth 1 -not -name ".snapshot" -not -name "lost+found" | \
        xargs chown -R 1001:1001
      chmod -R 777 /dev/shm

    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 25 Jun 2020 21:29:11 +0000
      Finished:     Thu, 25 Jun 2020 21:29:15 +0000
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        999m
      memory:     8Gi
    Environment:  <none>
    Mounts:
      /bitnami/postgresql from data (rw)
      /dev/shm from dshm (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-77twx (ro)
  istio-init:
    Container ID:  docker://fc3f84ef0c3a0800d47e04b09d9aea1da79af2a07ffed8b75332120de2622627
    Image:         docker.io/istio/proxyv2:1.4.5
    Image ID:      docker-pullable://istio/proxyv2@sha256:fc09ea0f969147a4843a564c5b677fbf3a6f94b56627d00b313b4c30d5fef094
    Port:          <none>
    Host Port:     <none>
    Command:
      istio-iptables
      -p
      15001
      -z
      15006
      -u
      1337
      -m
      REDIRECT
      -i
      *
      -x

      -b
      *
      -d
      15020
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 25 Jun 2020 21:29:16 +0000
      Finished:     Thu, 25 Jun 2020 21:29:16 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     100m
      memory:  50Mi
    Requests:
      cpu:        10m
      memory:     10Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-77twx (ro)
Containers:
  postgres-postgresql:
    Container ID:   docker://01c43a4ed85905c6d9599674118be8039406758dfcf7befc9bad4f6b956e4338
    Image:          docker.io/bitnami/postgresql:11.7.0-debian-10-r0
    Image ID:       docker-pullable://bitnami/postgresql@sha256:f946f10bdff3b1bc3617536a342da0f0b684623f5409f9c162be72df4155f384
    Port:           5432/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Sat, 11 Jul 2020 02:43:45 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Sat, 11 Jul 2020 01:51:26 +0000
      Finished:     Sat, 11 Jul 2020 02:43:45 +0000
    Ready:          True
    Restart Count:  534
    Requests:
      cpu:      900m
      memory:   8Gi
    Liveness:   exec [/bin/sh -c exec pg_isready -U "postgres" -h 127.0.0.1 -p 5432] delay=30s timeout=5s period=10s #success=1 #failure=6
    Readiness:  exec [/bin/sh -c -e exec pg_isready -U "postgres" -h 127.0.0.1 -p 5432
[ -f /opt/bitnami/postgresql/tmp/.initialized ] || [ -f /bitnami/postgresql/.initialized ]
] delay=5s timeout=5s period=10s #success=1 #failure=6
    Environment:
      BITNAMI_DEBUG:           false
      POSTGRESQL_PORT_NUMBER:  5432
      POSTGRESQL_VOLUME_DIR:   /bitnami/postgresql
      PGDATA:                  /bitnami/postgresql/data
      POSTGRES_USER:           postgres
      POSTGRES_PASSWORD:       <set to the key 'postgresql-password' in secret 'postgres-postgresql'>  Optional: false
      POSTGRESQL_ENABLE_LDAP:  no
    Mounts:
      /bitnami/postgresql from data (rw)
      /dev/shm from dshm (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-77twx (ro)
  istio-proxy:
    Container ID:  docker://a2199bde3d640eacaf1269d8417580e6997cf70b01c760f8e8700e7d0dd30e97
    Image:         docker.io/istio/proxyv2:1.4.5
    Image ID:      docker-pullable://istio/proxyv2@sha256:fc09ea0f969147a4843a564c5b677fbf3a6f94b56627d00b313b4c30d5fef094
    Port:          15090/TCP
    Host Port:     0/TCP
    Args:
      proxy
      sidecar
      --domain
      $(POD_NAMESPACE).svc.cluster.local
      --configPath
      /etc/istio/proxy
      --binaryPath
      /usr/local/bin/envoy
      --serviceCluster
      postgresql.$(POD_NAMESPACE)
      --drainDuration
      45s
      --parentShutdownDuration
      1m0s
      --discoveryAddress
      istio-pilot.istio-system:15010
      --zipkinAddress
      zipkin.istio-system:9411
      --dnsRefreshRate
      300s
      --connectTimeout
      10s
      --proxyAdminPort
      15000
      --concurrency
      2
      --controlPlaneAuthPolicy
      NONE
      --statusPort
      15020
      --applicationPorts
      5432
    State:          Running
      Started:      Thu, 25 Jun 2020 21:29:17 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:      100m
      memory:   128Mi
    Readiness:  http-get http://:15020/healthz/ready delay=1s timeout=1s period=2s #success=1 #failure=30
    Environment:
      POD_NAME:                          postgres-postgresql-0 (v1:metadata.name)
      ISTIO_META_POD_PORTS:              [
                                             {"name":"tcp-postgresql","containerPort":5432,"protocol":"TCP"}
                                         ]
      ISTIO_META_CLUSTER_ID:             Kubernetes
      POD_NAMESPACE:                     prd (v1:metadata.namespace)
      INSTANCE_IP:                        (v1:status.podIP)
      SERVICE_ACCOUNT:                    (v1:spec.serviceAccountName)
      ISTIO_META_POD_NAME:               postgres-postgresql-0 (v1:metadata.name)
      ISTIO_META_CONFIG_NAMESPACE:       prd (v1:metadata.namespace)
      SDS_ENABLED:                       false
      ISTIO_META_INTERCEPTION_MODE:      REDIRECT
      ISTIO_META_INCLUDE_INBOUND_PORTS:  5432
      ISTIO_METAJSON_LABELS:             {"app":"postgresql","chart":"postgresql-8.4.0","controller-revision-hash":"postgres-postgresql-7878b964b8","heritage":"Tiller","release":"postgres","role":"master","statefulset.kubernetes.io/pod-name":"postgres-postgresql-0"}

      ISTIO_META_WORKLOAD_NAME:          postgres-postgresql
      ISTIO_META_OWNER:                  kubernetes://apis/apps/v1/namespaces/prd/statefulsets/postgres-postgresql
    Mounts:
      /etc/certs/ from istio-certs (ro)
      /etc/istio/proxy from istio-envoy (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-77twx (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  dshm:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  10Gi
  data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  pvc0004
    ReadOnly:   false
  default-token-77twx:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-77twx
    Optional:    false
  istio-envoy:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  istio-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  istio.default
    Optional:    true
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                   From                   Message
  ----     ------     ----                  ----                   -------
  Warning  Unhealthy  30m (x6601 over 15d)  kubelet, worker-k8s-4  Liveness probe failed: 127.0.0.1:5432 - no response
  Warning  Unhealthy  30m (x7835 over 15d)  kubelet, worker-k8s-4  Readiness probe failed: 127.0.0.1:5432 - no response

juan131 commented 4 years ago

Hi @HenriqueLBorges

There is not much information about the reasons why the readiness probes are failing there. As you can see, these are the probes:

    Liveness:   exec [/bin/sh -c exec pg_isready -U "postgres" -h 127.0.0.1 -p 5432] delay=30s timeout=5s period=10s #success=1 #failure=6
    Readiness:  exec [/bin/sh -c -e exec pg_isready -U "postgres" -h 127.0.0.1 -p 5432

Could you please manually access your PostgreSQL container and manually run the command below?

pg_isready -U "postgres" -h 127.0.0.1 -p 5432

HenriqueLBorges commented 4 years ago

Hello @juan131

Here is the output:

root@master-k8s:~# kubectl exec -it postgres-postgresql-0 --namespace prd -- /bin/bash Defaulting container name to postgres-postgresql. Use 'kubectl describe pod/postgres-postgresql-0 -n prd' to see all of the containers in this pod. I have no name!@postgres-postgresql-0:/$ pg_isready -U "postgres" -h 127.0.0.1 -p 5432 127.0.0.1:5432 - accepting connections

juan131 commented 4 years ago

Hi @HenriqueLBorges

It seems your probes are working (at least at the very moment you tried them), but it's hard to debug since you don't have the logs of the probes when they fail.

The output from the probes is swallowed by the Kubelet component on the node. If a probe fails, its output will be recorded as an event associated with the pod. However, we didn't obtain any relevant information when you run the "kubectl describe" command. It says "127.0.0.1:5432 - no response" which is not very descriptive.

Maybe you can try editing the probes so you use a different command that provides more information about what's going on (instead of pg_isready).

caalberts commented 4 years ago

Did you perform any upgrade of your PostgreSQL release recently?

Hi @juan131, no it was not an upgrade to an existing PostgreSQL. It's a new PostgreSQL deployment.

juan131 commented 4 years ago

Hi @caalberts

The "no response" answer on pg_isready means the PostgreSQL server is not responding See https://www.postgresql.org/docs/12/app-pg-isready.html

That said, your logs didn't show any warn/error describing the reasons why it's not responding:

2020-07-08 08:50:05.499 GMT [261] LOG:  last completed transaction was at log time 2020-07-08 08:43:23.928121+00
2020-07-08 08:50:05.551 GMT [1] LOG:  database system is ready to accept connections

You can try increasing the log verbosity by setting log_error_verbosity in the postgresql.conf configuration file. To do, install the chart using the postgresqlExtendedConf parameter. E.g. using the values.yaml below

postgresqlExtendedConf:
  log_error_verbosity: verbose

HenriqueLBorges commented 4 years ago

Hi @juan131,

Do you have a different command to recommend in the probes?

Thanks in advance

juan131 commented 4 years ago

Hi @HenriqueLBorges

I would try first to increase the log verbosity as I mentioned in my previous comment. That said, you can replace the pg_isready probe with some query to the database (e.g. listing databases or sth like that).

One important thing you can do is to relax the frequency of the probes. The default "periodSeconds" value for the readiness&livenessprobes is set to 10 seconds. You can relax it to 30 seconds to avoid overloading your PostgreSQL server.

vishrantgupta commented 3 years ago

@HenriqueLBorges have you found what the issue was?

exocode commented 2 years ago

I am also interested to

juan131 commented 2 years ago

A possibility (although it's not the best alternative) can be using "tcpSocket" to simply ensure Pgpool is listening in the expected port. You can try it using the values below:

pgpool:
  livenessProbe:
    enabled: false
  customLivenessProbe:
    tcpSocket:
      port: postgresql
  readinessProbe:
    enabled: false
  customReadinessProbe:
    tcpSocket:
      port: postgresql

vishrantgupta commented 2 years ago

@juan131 please elaborate why it's not the best alternative and what are the draw backs?

exocode commented 2 years ago

In some other threads I read, that a "HTML request" may also work (I didn't made any test so far) something like this: (maybe someone can complete that)

readinessProbe:
  httpGet:
    path: /
    port: http
  initialDelaySeconds: 60
  periodSeconds: 15
  timeoutSeconds: 10
livenessProbe:
   httpGet:
     path: /
     port: http
   initialDelaySeconds: 60
   periodSeconds: 30
   timeoutSeconds: 10

juan131 commented 2 years ago

Hi @vishrantgupta

IMHO using tcpSocket is not optimal because it simply ensure there's a process listening on certain port. However, it doesn't check the health of the application nor if the app's ready to accept connections.

vishrantgupta commented 2 years ago

@exocode I don't think the http port is exposed in postgres, does it needs any change on postgres pod side?

juan131 commented 2 years ago

Using "httpGet" probes on PostgreSQL won't work since it doesn't expose any web endpoint

exocode commented 2 years ago

ok, didn't know that :-)