CrunchyData / postgres-operator

Production PostgreSQL for Kubernetes, from high availability Postgres clusters to full-scale database-as-a-service.
https://access.crunchydata.com/documentation/postgres-operator/v5/
Apache License 2.0
3.81k stars 578 forks source link

Liveness probe for postgres-operator #3593

Open polikeiji opened 1 year ago

polikeiji commented 1 year ago

Does the postgres-operator have an exposed port for configuring the liveness and the readiness probes?

Based on the following code, I expected we could utilize an exposed port for the probes offered by the controler-runtime.

I confirmed that we can access the metrics on the 8080 port, but I couldn't connect to the 8081 port for the probes. I checked it with PGO 5.3.0 running on the K8S 1.21.14.

We have yet to report it here because we couldn't get enough logs to do it, but we faced an issue that the operator suddenly stopped its reconciliation process. At that time, the operator process itself didn't have any bad statuses, so K8S couldn't trigger rebooting targeting the operator pod. To prevent the situation, we'd like to configure the readiness and liveness probes for the operator.

jmckulk commented 1 year ago

Hey @polikeiji, we aren't doing anything to expose ports on the operator container. If there is anything wrong with the operator process, the pod should become not ready and restart.

Can you see the logs from the operator container? I would expect any issues to show up there.

If you do find any logs, feel free to share them here. You can also share your cluster spec, and we'd be happy to look for anything out of place.

polikeiji commented 1 year ago

@jmckulk Thank you for the reply.

If there is anything wrong with the operator process, the pod should become not ready and restart.

I think the operator container will become "not ready" and be restarted only when the container process exits unless the liveness probe is configured for the container. So, I believe every operator basically must have the appropriate liveness probe to correctly detect something wrong occurs including the situation that the process keeps running in some wrong situation.

Can you see the logs from the operator container?

We checked the operator's logs when we faced the issue, but we couldn't find any error logs which is possibly relevant to the issue. We were running the PGO operator by enabling the debug logs. We saw the disappearance of the reconciliation debug messages from the logs, and only debug messages related to the PGO version check process were recorded. If we face the issue again and figure out how to reproduce it, we'll report it here :)

tony-landreth commented 1 year ago

Hi @polikeiji! Great question about readiness and liveness probes. We'll add a story about probes for the operator to our backlog. Thanks!

baptman21 commented 2 months ago

Hello,

I just wanted to add a small up on this issue.

We experienced a similar problem recently where the operator just stopped processing events. There was no noticeable increase in CPU or RAM before, but out of the blue it stopped responding, or processing reconcile events, and the CPU and RAM went down.

There were also no logs at all in the operator itself (in debug mode), and we were not able to reproduce at the moment, so I can't really open an issue for now, but we are trying to reproduce.

The liveness could really help :pray:. In the meantime, is there any other way to check if the operator is alive ? From what we gathered, the main process was not killed, so my guess is that one of the child process was, but apart from checking the logs I don't see a way to detect this problem.