Open polikeiji opened 1 year ago
Hey @polikeiji, we aren't doing anything to expose ports on the operator container. If there is anything wrong with the operator process, the pod should become not ready and restart.
Can you see the logs from the operator container? I would expect any issues to show up there.
If you do find any logs, feel free to share them here. You can also share your cluster spec, and we'd be happy to look for anything out of place.
@jmckulk Thank you for the reply.
If there is anything wrong with the operator process, the pod should become not ready and restart.
I think the operator container will become "not ready" and be restarted only when the container process exits unless the liveness probe is configured for the container. So, I believe every operator basically must have the appropriate liveness probe to correctly detect something wrong occurs including the situation that the process keeps running in some wrong situation.
Can you see the logs from the operator container?
We checked the operator's logs when we faced the issue, but we couldn't find any error logs which is possibly relevant to the issue. We were running the PGO operator by enabling the debug logs. We saw the disappearance of the reconciliation debug messages from the logs, and only debug messages related to the PGO version check process were recorded. If we face the issue again and figure out how to reproduce it, we'll report it here :)
Hi @polikeiji! Great question about readiness and liveness probes. We'll add a story about probes for the operator to our backlog. Thanks!
Hello,
I just wanted to add a small up on this issue.
We experienced a similar problem recently where the operator just stopped processing events. There was no noticeable increase in CPU or RAM before, but out of the blue it stopped responding, or processing reconcile
events, and the CPU and RAM went down.
There were also no logs at all in the operator itself (in debug mode), and we were not able to reproduce at the moment, so I can't really open an issue for now, but we are trying to reproduce.
The liveness could really help :pray:. In the meantime, is there any other way to check if the operator is alive ? From what we gathered, the main process was not killed, so my guess is that one of the child process was, but apart from checking the logs I don't see a way to detect this problem.
Does the postgres-operator have an exposed port for configuring the liveness and the readiness probes?
Based on the following code, I expected we could utilize an exposed port for the probes offered by the controler-runtime.
I confirmed that we can access the metrics on the
8080
port, but I couldn't connect to the8081
port for the probes. I checked it with PGO5.3.0
running on the K8S1.21.14
.We have yet to report it here because we couldn't get enough logs to do it, but we faced an issue that the operator suddenly stopped its reconciliation process. At that time, the operator process itself didn't have any bad statuses, so K8S couldn't trigger rebooting targeting the operator pod. To prevent the situation, we'd like to configure the readiness and liveness probes for the operator.