Open oumkale opened 3 years ago
When encountering an error like this, the controller retries with backoff. I suspect you're seeing it in one of the longer waiting periods. (We use the default which maxes out at 16m40s.)
Do you have some more timestamps for those error="command terminated with exit code 1"
? How long did it stay in this state?
Hi @cbandy, Thank you for your reply.
Just doing network loss for a couple of seconds against one of statefulset. The operator is in a running state(not backoff) only logs are showing errors. Tried with both manager cases CRUNCHY_DEBUG : "true" it is in waiting for state at the above error and If "false" logs continue with the same error with some time gap.
I tried after 4-5min(after operator stuck) scaling replicas 2->3 but it was not getting scaled. Only showing configured but not actually. here At the same not able to delete or create object of the postgrescluster.
The operator is in a running state(not backoff) only logs are showing errors.
After showing level=error msg="Reconciler error"
in the logs, the controller will wait (backoff) before trying to process that PostgresCluster again. What do you mean by "running state" here?
I tried after 4-5min(after operator stuck) scaling replicas 2->3 but it was not getting scaled. Only showing configured but not actually.
After 5 minutes of level=error msg="Reconciler error"
the time between retries is ~5.5 minutes. It is normal to see no messages logged when there is a single PostgresCluster.
🤔 Even so, it should respond to changes to the PostgresCluster spec. What do you mean by "showing configured but not actually"? Did the operator log any messages?
At the same not able to delete or create object of the postgrescluster.
Same here. The operator should respond when that PostgresCluster is deleted. How did you delete it, kubectl delete
?
If you still have logs, I'd like to see the exact messages and timestamps. The ones from running with CRUNCHY_DEBUG : "true"
will be more informative, if you have those.
Operator Logs:
time="2021-09-06T10:30:51Z" level=debug msg="replaced configuration" file="internal/patroni/api.go:85" func=patroni.Executor.ReplaceConfiguration name=keycloakdb namespace=postgres reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster stderr="2021-09-06 10:30:43,941 - WARNING - Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fa67ea2b2e8>, 'Connection to 10.100.0.1 timed out. (connect timeout=2.5)')': /api/v1/namespaces/postgres/pods?labelSelector=postgres-operator.crunchydata.com%2Fcluster%3Dkeycloakdb%2Cpostgres-operator.crunchydata.com%2Fpatroni%3Dkeycloakdb-ha\n2021-09-06 10:30:43,942 - WARNING - Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fa67d5b6d30>, 'Connection to 10.100.0.1 timed out. (connect timeout=2.5)')': /api/v1/namespaces/postgres/endpoints?labelSelector=postgres-operator.crunchydata.com%2Fcluster%3Dkeycloakdb%2Cpostgres-operator.crunchydata.com%2Fpatroni%3Dkeycloakdb-ha\n2021-09-06 10:30:46,446 - ERROR - Request to server https://10.100.0.1:443 failed: MaxRetryError(\"HTTPSConnectionPool(host='10.100.0.1', port=443): Max retries exceeded with url: /api/v1/namespaces/postgres/pods?labelSelector=postgres-operator.crunchydata.com%2Fcluster%3Dkeycloakdb%2Cpostgres-operator.crunchydata.com%2Fpatroni%3Dkeycloakdb-ha (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fa67d5b6da0>, 'Connection to 10.100.0.1 timed out. (connect timeout=2.5)'))\",)\n2021-09-06 10:30:46,447 - ERROR - Request to server https://10.100.0.1:443 failed: MaxRetryError(\"HTTPSConnectionPool(host='10.100.0.1', port=443): Max retries exceeded with url: /api/v1/namespaces/postgres/endpoints?labelSelector=postgres-operator.crunchydata.com%2Fcluster%3Dkeycloakdb%2Cpostgres-operator.crunchydata.com%2Fpatroni%3Dkeycloakdb-ha (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fa67d5b6f28>, 'Connection to 10.100.0.1 timed out. (connect timeout=2.5)'))\",)\n2021-09-06 10:30:47,448 - ERROR - ObjectCache.run TypeError(\"unsupported operand type(s) for -: 'NoneType' and 'float'\",)\n2021-09-06 10:30:47,449 - ERROR - ObjectCache.run TypeError(\"unsupported operand type(s) for -: 'NoneType' and 'float'\",)\n2021-09-06 10:30:49,952 - WARNING - Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fa67d5d1198>, 'Connection to 10.100.0.1 timed out. (connect timeout=2.5)')': /api/v1/namespaces/postgres/endpoints?labelSelector=postgres-operator.crunchydata.com%2Fcluster%3Dkeycloakdb%2Cpostgres-operator.crunchydata.com%2Fpatroni%3Dkeycloakdb-ha\n2021-09-06 10:30:49,953 - WARNING - Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fa67d5d16d8>, 'Connection to 10.100.0.1 timed out. (connect timeout=2.5)')': /api/v1/namespaces/postgres/pods?labelSelector=postgres-operator.crunchydata.com%2Fcluster%3Dkeycloakdb%2Cpostgres-operator.crunchydata.com%2Fpatroni%3Dkeycloakdb-ha\n2021-09-06 10:30:51,441 - ERROR - get_cluster\nTraceback (most recent call last):\n File \"/usr/local/lib/python3.6/site-packages/patroni/dcs/kubernetes.py\", line 701, in _load_cluster\n self._wait_caches(stop_time)\n File \"/usr/local/lib/python3.6/site-packages/patroni/dcs/kubernetes.py\", line 693, in _wait_caches\n raise RetryFailedError('Exceeded retry deadline')\npatroni.utils.RetryFailedError: 'Exceeded retry deadline'\nTraceback (most recent call last):\n File \"/usr/local/lib/python3.6/site-packages/patroni/dcs/kubernetes.py\", line 701, in _load_cluster\n self._wait_caches(stop_time)\n File \"/usr/local/lib/python3.6/site-packages/patroni/dcs/kubernetes.py\", line 693, in _wait_caches\n raise RetryFailedError('Exceeded retry deadline')\npatroni.utils.RetryFailedError: 'Exceeded retry deadline'\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File \"/usr/local/bin/patronictl\", line 11, in <module>\n sys.exit(ctl())\n File \"/usr/local/lib/python3.6/site-packages/click/core.py\", line 1134, in __call__\n return self.main(*args, **kwargs)\n File \"/usr/local/lib/python3.6/site-packages/click/core.py\", line 1059, in main\n rv = self.invoke(ctx)\n File \"/usr/local/lib/python3.6/site-packages/click/core.py\", line 1665, in invoke\n return _process_result(sub_ctx.command.invoke(sub_ctx))\n File \"/usr/local/lib/python3.6/site-packages/click/core.py\", line 1401, in invoke\n return ctx.invoke(self.callback, **ctx.params)\n File \"/usr/local/lib/python3.6/site-packages/click/core.py\", line 767, in invoke\n return __callback(*args, **kwargs)\n File \"/usr/local/lib/python3.6/site-packages/click/decorators.py\", line 38, in new_func\n return f(get_current_context().obj, *args, **kwargs)\n File \"/usr/local/lib/python3.6/site-packages/patroni/ctl.py\", line 1221, in edit_config\n cluster = dcs.get_cluster()\n File \"/usr/local/lib/python3.6/site-packages/patroni/dcs/__init__.py\", line 664, in get_cluster\n cluster = self._load_cluster()\n File \"/usr/local/lib/python3.6/site-packages/patroni/dcs/kubernetes.py\", line 766, in _load_cluster\n raise KubernetesError('Kubernetes API is not responding properly')\npatroni.dcs.kubernetes.KubernetesError: 'Kubernetes API is not responding properly'\n" stdout= version=5.0.0-0
time="2021-09-06T10:30:51Z" level=debug msg="reconciled cluster" file="internal/controller/postgrescluster/controller.go:277" func="postgrescluster.(*Reconciler).Reconcile" name=keycloakdb namespace=postgres reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.0-0
time="2021-09-06T10:30:51Z" level=error msg="Reconciler error" error="command terminated with exit code 1" file="internal/controller/postgrescluster/patroni.go:147" func="postgrescluster.(*Reconciler).reconcilePatroniDynamicConfiguration" name=keycloakdb namespace=postgres reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.0-0
time="2021-09-06T10:30:52Z" level=debug msg="replaced configuration" file="internal/patroni/api.go:85" func=patroni.Executor.ReplaceConfiguration name=keycloakdb namespace=postgres reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster stderr= stdout="Not changed\n" version=5.0.0-0
Operator pod state:
postgres-operator-8696f5676c-psk6k 1/1 Running 0 42m
Now I want to scale down so apply
oumkale@chaosnative:~/demo/postgres-operator-examples$ kubectl apply -k kustomize/keycloak/ (with single replicathis time earlier it was 2)
deployment.apps/keycloak unchanged
postgrescluster.postgres-operator.crunchydata.com/keycloakdb configured
But no change in postgres pods
Every 2.0s: kubectl get po -n postgres chaosnative: Mon Sep 6 16:14:49 2021
NAME READY STATUS RESTARTS AGE
crunchy-alertmanager-d479c967c-4mb86 1/1 Running 0 28m
crunchy-grafana-6f4c98c4bc-sgfsg 1/1 Running 0 28m
crunchy-prometheus-5bff446746-qv9j8 1/1 Running 0 28m
keycloak-c978b8c69-dtndt 1/1 Running 0 40m
keycloakdb-00-dkkf-0 4/4 Running 0 21m
keycloakdb-00-x6ns-0 4/4 Running 8 25m
keycloakdb-backup-ws5q-xdmtv 0/1 Completed 0 39m
keycloakdb-repo-host-0 1/1 Running 0 40m
postgres-operator-8696f5676c-psk6k 1/1 Running 0 41m
this is happening after network loss for couple of seconds on master @cbandy @jkatz
@jkatz
We've faced the same issue in our managed cluster when our provider performed a weekly maintenance and a bunch of pods moved to the new nodes. DB instances are stuck trying to reach k8s API:
2023-10-30 13:06:33,217 ERROR: Request to server https://10.233.0.1:443 failed: MaxRetryError("HTTPSConnectionPool(host='10.233.0.1', port=443): Max retries exceeded with url: /api/v1/namespaces/custom-ns/pods?labelSelector=postgres-operator.crunchydata.com%2Fcluster%3Dname_of_cluster%2Cpostgres-operator.crunchydata.com%2Fpatroni%3Dname_of_cluster-ha (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f6a41b0a208>: Failed to establish a new connection: [Errno 113] No route to host',))",)
Whilst the operator itself can't reach the API at the same address (logs are similar to this), regardless the node it's running on and/or restarts, leading to an infinite error loop.
Please ensure you do the following when reporting a bug:
I was injecting network loss chaos on
master
database. On post Network Loss Operator is stuck at below error and unable to create a new cluster or delete the current one.Note that some logs needed to troubleshoot may be found in the
/pgdata/<CLUSTERNAME>/pg_log
directory on your Postgres instance.An incomplete bug report can lead to delays in resolving the issue or the closing of a ticket, so please be as detailed as possible.
If you are looking for general support, please view the support page for where you can ask questions.
Thanks for reporting the issue, we're looking forward to helping you!
Overview
Add a concise description of what the bug is.
Environment
Please provide the following details:
Kubernetes
,EKS
)1.20.3
,4.7.0
)ubi8-5.0.0-0
)13
)hostpath
,nfs
, or the name of your storage class)Operator image details: Tried with a couple of other tags also
EXPECTED
ACTUAL
Logs