Closed SeanZicari closed 4 years ago
I tried upgrading to chart version 3.5.4 but the issue was still there. Maybe I just don't know how to reset the installation properly without losing data?
I just learned about the Parallel Pod Management policy. Maybe that would be a better option than the default OrderedReady policy? Seems like it would prevent the problem I ran into from happening, because all replicas would come up at the same time and be able to find each other.
Hi @SeanZicari !
As this issue seems to be related with the postgresql-ha chart itself, I have tried reproducing it there. I am also using a GKE as you specified, but unfortunately I have been unable to face this problem. I have put the cluster to several scaling rounds and it seems to be working for me. Maybe I am not reproducing the steps correctly, here is my workflow:
1- Set a custom password for both potgresql and repmgr
password: mypassword
repmgrPassword: mypassword
2- Create a brand new release with these values
$ helm install kappa bitnami/postgresql-ha
NAME: kappa
LAST DEPLOYED: Tue Sep 8 10:52:24 2020
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
...
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
kappa-postgresql-ha-pgpool-79449bf9b6-fzx9k 1/1 Running 0 7m13s
kappa-postgresql-ha-postgresql-0 1/1 Running 0 7m13s
kappa-postgresql-ha-postgresql-1 1/1 Running 0 6m39s
2- Upgrade the replicaCount
values for both postgresql and pgpool to zero:
pgpool:
replicaCount: 0
postgresql:
replicaCount: 0
3- Perform an upgrade
$ helm upgrade kappa bitnami/postgresql-ha
Release "kappa" has been upgraded. Happy Helming!
NAME: kappa
LAST DEPLOYED: Tue Sep 8 11:00:16 2020
NAMESPACE: default
STATUS: deployed
REVISION: 2
TEST SUITE: None
...
$ kubectl get pods
No resources found in default namespace.
4 - Restore replicaCount
values and perform an upgrade
$ helm upgrade kappa bitnami/postgresql-ha
Release "kappa" has been upgraded. Happy Helming!
NAME: kappa
LAST DEPLOYED: Tue Sep 8 11:02:04 2020
NAMESPACE: default
STATUS: deployed
REVISION: 3
...
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
kappa-postgresql-ha-pgpool-79449bf9b6-jqjvn 1/1 Running 0 92s
kappa-postgresql-ha-postgresql-0 1/1 Running 0 91s
kappa-postgresql-ha-postgresql-1 1/1 Running 0 66s
I have done steps 2-4 three times and in every occasion pods were able to start up normally.
Thanks!
I appreciate you trying to reproduce the problem! There are a couple things that are different about my situation. I am not changing the replicaCount; I scaled the entire cluster up and down using GKE's node scaling. I probably did that at least 6 or so times before the problem occurred.
You also may need to put some data into the database. I don't know if the problem will occur without actual data to replicate.
Hi @SeanZicari
I am not very familiar with the scaling capabilities of the GKE cluster. Are you using the autoscaler option? Or are you deleting the nodes from the cluster and then adding them back?
In any case, would you mind trying scaling the cluster using the provided parameters and helm? I don't really know how GKE performs the scaling operation, but this could be related to this https://github.com/bitnami/charts/issues/3431#issuecomment-674836237 https://github.com/bitnami/charts/issues/3431#issuecomment-679947772
Regrads
Hi @joancafom , An easy way to reproduce the issue is to deploy the chart.
Since it’s a statefullset the postgresql-1 won’t start until the postgresql-0 is running
However the postgresql-0 node won’t start because it’s not the primary node
I agree with @SeanZicari , the solution migth be with the parallel pod management but we need to make sure that there is no side effect.
Is it possible that setting a higher postgresql.repmgrConnectTimeout
would allow the first pod to stay up long enough for Kubernetes to bring up the second pod, or is the pod not considered healthy until the repmgr has fully started up? I didn’t think about increasing the connect timeout before.
Though I do think parallel pod management more closely matches traditional deployments wherein both instances are available at the same time.
Hi all, So I understand the issue is more related to how the pods are created than with the scalation itself right? I will test the scenario @jp-gouin is commenting with and without the Pod Parallel Management to check if that solves the issue and if that solution has any side effect. Thank you for the suggestions guys!
Hi guys! We have tested changes suggested by @jp-gouin and it seems that the cluster is properly recovered. You can see the test steps in the PR above. Regards.
This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.
We are still looking into the issue. More info at the PR #3681
@miguelaeh Awesome to see you taking action on this. Thanks again for that!
Hi @SeanZicari , Could you check if the new chart version fixes this issue for you ?.
Which chart: 3.4.3
Describe the bug After running a GKE cluster for a while in which postgresql-ha was a subchart to support a Django application, I scaled the cluster down for a few days. After scaling back up, there were issues logging into the Django site...I kept immediately being logged out with no error message. I suspect maybe the session information wasn't being written to the database and so Django kept "forgetting" I was logged in. It seems like the scale-up didn't bring up all the services correctly. I tried scaling down and then up once more (it was right before I was supposed to use the Django site for a presentation) and at that point postgresql-ha would not come online.
To Reproduce Steps to reproduce the behavior:
Expected behavior postgresql-ha would come back online correctly without this apparent race condition that now causes the first replica to wait for the second replica in a StatefulSet which can't come online until the first replica is started up.
Version of Helm and Kubernetes:
helm version
:kubectl version
:Additional context Here is the log output from postgresql-ha-postgresql-0 when it started up and failed because it was waiting for postgresql-ha-postgresql-1:
postgresql.log