On network loss for any one of statefulstate, Operator is throwing error.

oumkale commented 3 years ago

Please ensure you do the following when reporting a bug:

I was injecting network loss chaos on master database. On post Network Loss Operator is stuck at below error and unable to create a new cluster or delete the current one.

Not able to scale replicas

[x] Provide a concise description of what the bug is.
[x] Provide information about your environment.
[x] Provide clear steps to reproduce the bug.
[x] Attach applicable logs. Please do not attach screenshots showing logs unless you are unable to copy and paste the log data.
[x] Ensure any code / output examples are properly formatted for legibility.

Note that some logs needed to troubleshoot may be found in the /pgdata/<CLUSTERNAME>/pg_log directory on your Postgres instance.

An incomplete bug report can lead to delays in resolving the issue or the closing of a ticket, so please be as detailed as possible.

If you are looking for general support, please view the support page for where you can ask questions.

Thanks for reporting the issue, we're looking forward to helping you!

Overview

Add a concise description of what the bug is.

Environment

Please provide the following details:

Platform: (Kubernetes, EKS)
Platform Version: (e.g. 1.20.3, 4.7.0)
PGO Image Tag: (e.g. ubi8-5.0.0-0)
Postgres Version (e.g. 13)
Storage: (e.g. hostpath, nfs, or the name of your storage class)

Operator image details: Tried with a couple of other tags also

images:
- name: postgres-operator
  newName: registry.developers.crunchydata.com/crunchydata/postgres-operator
  newTag: ubi8-5.0.0-0

EXPECTED

It should not be stuck at a given error, Need to achieve a healthy state again

ACTUAL

Operator is stuck at below error and unable to create a new cluster or delete the current one.

Logs

raise KubernetesError('Kubernetes API is not responding properly')\npatroni.dcs.kubernetes.KubernetesError: 'Kubernetes API is not responding properly'\n" stdout= version=5.0.1-0
time="2021-08-19T11:21:58Z" level=debug msg="reconciled cluster" file="internal/controller/postgrescluster/controller.go:275" func="postgrescluster.(*Reconciler).Reconcile" name=keycloakdb namespace=postgres reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.1-0
time="2021-08-19T11:21:58Z" level=error msg="Reconciler error" error="command terminated with exit code 1" file="internal/controller/postgrescluster/patroni.go:147" func="postgrescluster.(*Reconciler).reconcilePatroniDynamicConfiguration" name=keycloakdb namespace=postgres reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.1-0

cbandy commented 3 years ago

When encountering an error like this, the controller retries with backoff. I suspect you're seeing it in one of the longer waiting periods. (We use the default which maxes out at 16m40s.)

Do you have some more timestamps for those error="command terminated with exit code 1"? How long did it stay in this state?

oumkale commented 3 years ago

Hi @cbandy, Thank you for your reply.

Just doing network loss for a couple of seconds against one of statefulset. The operator is in a running state(not backoff) only logs are showing errors. Tried with both manager cases CRUNCHY_DEBUG : "true" it is in waiting for state at the above error and If "false" logs continue with the same error with some time gap.

I tried after 4-5min(after operator stuck) scaling replicas 2->3 but it was not getting scaled. Only showing configured but not actually. here At the same not able to delete or create object of the postgrescluster.

cbandy commented 3 years ago

The operator is in a running state(not backoff) only logs are showing errors.

After showing level=error msg="Reconciler error" in the logs, the controller will wait (backoff) before trying to process that PostgresCluster again. What do you mean by "running state" here?

I tried after 4-5min(after operator stuck) scaling replicas 2->3 but it was not getting scaled. Only showing configured but not actually.

After 5 minutes of level=error msg="Reconciler error" the time between retries is ~5.5 minutes. It is normal to see no messages logged when there is a single PostgresCluster.

🤔 Even so, it should respond to changes to the PostgresCluster spec. What do you mean by "showing configured but not actually"? Did the operator log any messages?

At the same not able to delete or create object of the postgrescluster.

Same here. The operator should respond when that PostgresCluster is deleted. How did you delete it, kubectl delete?

If you still have logs, I'd like to see the exact messages and timestamps. The ones from running with CRUNCHY_DEBUG : "true" will be more informative, if you have those.

oumkale commented 3 years ago

Operator Logs:

time="2021-09-06T10:30:51Z" level=debug msg="replaced configuration" file="internal/patroni/api.go:85" func=patroni.Executor.ReplaceConfiguration name=keycloakdb namespace=postgres reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster stderr="2021-09-06 10:30:43,941 - WARNING - Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fa67ea2b2e8>, 'Connection to 10.100.0.1 timed out. (connect timeout=2.5)')': /api/v1/namespaces/postgres/pods?labelSelector=postgres-operator.crunchydata.com%2Fcluster%3Dkeycloakdb%2Cpostgres-operator.crunchydata.com%2Fpatroni%3Dkeycloakdb-ha\n2021-09-06 10:30:43,942 - WARNING - Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fa67d5b6d30>, 'Connection to 10.100.0.1 timed out. (connect timeout=2.5)')': /api/v1/namespaces/postgres/endpoints?labelSelector=postgres-operator.crunchydata.com%2Fcluster%3Dkeycloakdb%2Cpostgres-operator.crunchydata.com%2Fpatroni%3Dkeycloakdb-ha\n2021-09-06 10:30:46,446 - ERROR - Request to server https://10.100.0.1:443 failed: MaxRetryError(\"HTTPSConnectionPool(host='10.100.0.1', port=443): Max retries exceeded with url: /api/v1/namespaces/postgres/pods?labelSelector=postgres-operator.crunchydata.com%2Fcluster%3Dkeycloakdb%2Cpostgres-operator.crunchydata.com%2Fpatroni%3Dkeycloakdb-ha (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fa67d5b6da0>, 'Connection to 10.100.0.1 timed out. (connect timeout=2.5)'))\",)\n2021-09-06 10:30:46,447 - ERROR - Request to server https://10.100.0.1:443 failed: MaxRetryError(\"HTTPSConnectionPool(host='10.100.0.1', port=443): Max retries exceeded with url: /api/v1/namespaces/postgres/endpoints?labelSelector=postgres-operator.crunchydata.com%2Fcluster%3Dkeycloakdb%2Cpostgres-operator.crunchydata.com%2Fpatroni%3Dkeycloakdb-ha (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fa67d5b6f28>, 'Connection to 10.100.0.1 timed out. (connect timeout=2.5)'))\",)\n2021-09-06 10:30:47,448 - ERROR - ObjectCache.run TypeError(\"unsupported operand type(s) for -: 'NoneType' and 'float'\",)\n2021-09-06 10:30:47,449 - ERROR - ObjectCache.run TypeError(\"unsupported operand type(s) for -: 'NoneType' and 'float'\",)\n2021-09-06 10:30:49,952 - WARNING - Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fa67d5d1198>, 'Connection to 10.100.0.1 timed out. (connect timeout=2.5)')': /api/v1/namespaces/postgres/endpoints?labelSelector=postgres-operator.crunchydata.com%2Fcluster%3Dkeycloakdb%2Cpostgres-operator.crunchydata.com%2Fpatroni%3Dkeycloakdb-ha\n2021-09-06 10:30:49,953 - WARNING - Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fa67d5d16d8>, 'Connection to 10.100.0.1 timed out. (connect timeout=2.5)')': /api/v1/namespaces/postgres/pods?labelSelector=postgres-operator.crunchydata.com%2Fcluster%3Dkeycloakdb%2Cpostgres-operator.crunchydata.com%2Fpatroni%3Dkeycloakdb-ha\n2021-09-06 10:30:51,441 - ERROR - get_cluster\nTraceback (most recent call last):\n  File \"/usr/local/lib/python3.6/site-packages/patroni/dcs/kubernetes.py\", line 701, in _load_cluster\n    self._wait_caches(stop_time)\n  File \"/usr/local/lib/python3.6/site-packages/patroni/dcs/kubernetes.py\", line 693, in _wait_caches\n    raise RetryFailedError('Exceeded retry deadline')\npatroni.utils.RetryFailedError: 'Exceeded retry deadline'\nTraceback (most recent call last):\n  File \"/usr/local/lib/python3.6/site-packages/patroni/dcs/kubernetes.py\", line 701, in _load_cluster\n    self._wait_caches(stop_time)\n  File \"/usr/local/lib/python3.6/site-packages/patroni/dcs/kubernetes.py\", line 693, in _wait_caches\n    raise RetryFailedError('Exceeded retry deadline')\npatroni.utils.RetryFailedError: 'Exceeded retry deadline'\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File \"/usr/local/bin/patronictl\", line 11, in <module>\n    sys.exit(ctl())\n  File \"/usr/local/lib/python3.6/site-packages/click/core.py\", line 1134, in __call__\n    return self.main(*args, **kwargs)\n  File \"/usr/local/lib/python3.6/site-packages/click/core.py\", line 1059, in main\n    rv = self.invoke(ctx)\n  File \"/usr/local/lib/python3.6/site-packages/click/core.py\", line 1665, in invoke\n    return _process_result(sub_ctx.command.invoke(sub_ctx))\n  File \"/usr/local/lib/python3.6/site-packages/click/core.py\", line 1401, in invoke\n    return ctx.invoke(self.callback, **ctx.params)\n  File \"/usr/local/lib/python3.6/site-packages/click/core.py\", line 767, in invoke\n    return __callback(*args, **kwargs)\n  File \"/usr/local/lib/python3.6/site-packages/click/decorators.py\", line 38, in new_func\n    return f(get_current_context().obj, *args, **kwargs)\n  File \"/usr/local/lib/python3.6/site-packages/patroni/ctl.py\", line 1221, in edit_config\n    cluster = dcs.get_cluster()\n  File \"/usr/local/lib/python3.6/site-packages/patroni/dcs/__init__.py\", line 664, in get_cluster\n    cluster = self._load_cluster()\n  File \"/usr/local/lib/python3.6/site-packages/patroni/dcs/kubernetes.py\", line 766, in _load_cluster\n    raise KubernetesError('Kubernetes API is not responding properly')\npatroni.dcs.kubernetes.KubernetesError: 'Kubernetes API is not responding properly'\n" stdout= version=5.0.0-0
time="2021-09-06T10:30:51Z" level=debug msg="reconciled cluster" file="internal/controller/postgrescluster/controller.go:277" func="postgrescluster.(*Reconciler).Reconcile" name=keycloakdb namespace=postgres reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.0-0
time="2021-09-06T10:30:51Z" level=error msg="Reconciler error" error="command terminated with exit code 1" file="internal/controller/postgrescluster/patroni.go:147" func="postgrescluster.(*Reconciler).reconcilePatroniDynamicConfiguration" name=keycloakdb namespace=postgres reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.0-0
time="2021-09-06T10:30:52Z" level=debug msg="replaced configuration" file="internal/patroni/api.go:85" func=patroni.Executor.ReplaceConfiguration name=keycloakdb namespace=postgres reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster stderr= stdout="Not changed\n" version=5.0.0-0

Operator pod state:

postgres-operator-8696f5676c-psk6k     1/1     Running     0          42m

Now I want to scale down so apply

oumkale@chaosnative:~/demo/postgres-operator-examples$  kubectl apply -k kustomize/keycloak/  (with single replicathis time earlier it was 2)
deployment.apps/keycloak unchanged
postgrescluster.postgres-operator.crunchydata.com/keycloakdb configured

But no change in postgres pods

Every 2.0s: kubectl get po -n postgres                                                                             chaosnative: Mon Sep  6 16:14:49 2021

NAME                                   READY   STATUS      RESTARTS   AGE
crunchy-alertmanager-d479c967c-4mb86   1/1     Running     0          28m
crunchy-grafana-6f4c98c4bc-sgfsg       1/1     Running     0          28m
crunchy-prometheus-5bff446746-qv9j8    1/1     Running     0          28m
keycloak-c978b8c69-dtndt               1/1     Running     0          40m
keycloakdb-00-dkkf-0                   4/4     Running     0          21m
keycloakdb-00-x6ns-0                   4/4     Running     8          25m
keycloakdb-backup-ws5q-xdmtv           0/1     Completed   0          39m
keycloakdb-repo-host-0                 1/1     Running     0          40m
postgres-operator-8696f5676c-psk6k     1/1     Running     0          41m

this is happening after network loss for couple of seconds on master @cbandy @jkatz

importnil commented 1 year ago

@jkatz

We've faced the same issue in our managed cluster when our provider performed a weekly maintenance and a bunch of pods moved to the new nodes. DB instances are stuck trying to reach k8s API:

2023-10-30 13:06:33,217 ERROR: Request to server https://10.233.0.1:443 failed: MaxRetryError("HTTPSConnectionPool(host='10.233.0.1', port=443): Max retries exceeded with url: /api/v1/namespaces/custom-ns/pods?labelSelector=postgres-operator.crunchydata.com%2Fcluster%3Dname_of_cluster%2Cpostgres-operator.crunchydata.com%2Fpatroni%3Dname_of_cluster-ha (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f6a41b0a208>: Failed to establish a new connection: [Errno 113] No route to host',))",)

Whilst the operator itself can't reach the API at the same address (logs are similar to this), regardless the node it's running on and/or restarts, leading to an infinite error loop.

CrunchyData / postgres-operator