ClusterLabs / resource-agents

Combined repository of OCF agents from the RHCS and Linux-HA projects
GNU General Public License v2.0
492 stars 581 forks source link

pgsql not deleting PGSQL.lock file when another node get master role #699

Open playmobil77d opened 8 years ago

playmobil77d commented 8 years ago

Hi,

I have a 2-node cluster using PostgreSQL synchronous streaming replication. I don’t have preference of the location of the master role. I followed this documentation : http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster, and it the replication works great.

I just have a question about the PGSQL.lock file.

When the master crashes (node A), the resources switches on the hot standby slave (node B), it’s OK. I can create record on the DB of node B. But when the old master restart (node A), I have an error “My data may be inconsistent” because there is the PGSQL.lock file on this node. I don’t understand why this file is not deleted on start when the master role is on another node of the cluster.

In my mind, I’d like it to work like that : When A crashes, Master role switches on the second node (B) and it has the last data because it’s a synchronous replication. So it become the reference. So, when the node A restart, pacemaker put it on slave (hot standby), activate the replication and that’s all. Later, if B crashes, master roles switches to A, etc ….

When I delete manually the PGSQL.lock, all works great : the slave synchronizes to the master.

Is this the normal operation ? Is there a way to do that automatically ? Is there a function in the pgsql RA where pacemaker can delete this lock file when the master role is already on another node in the cluster ?

kskmori commented 8 years ago

Hi,

Is this the normal operation ?

Yes, it is intended to be done manually, at least for the current implmenentation.

There are two reasons for this:

  1. To prevent to start a node solely with outdated data.
  2. To prompt to users to make sure the proper recovery steps for the database has been done.

You've got a really good point and it would probably resolve the former issue, but the latter one is a bit complicated.

In the case of your scenario,

So, when the node A restart, pacemaker put it on slave (hot standby), activate the replication and that's all.

that replication may fail depending on the internal status of PostgreSQL (and the version) after started, if you did not done the proper recovery steps on the node A. You can notice a such failed state with "pgsql-status: HS:alone" but it's really confusing because it "looks" start the resource and the replication successfully, but eventually goes to the failed state after a certain period of time.

Only the way to operate the cluster in stable is that always taking the basebackup from the master node and restore it to the node A before starting it.

So the lock file is indicating that the human intervention is always necessary at this point.

Is there a way to do that automatically ? Is there a function in the pgsql RA where pacemaker can delete this lock file when the master role is already on another node in the cluster ?

Not right now, but of course suggestions are always welcome if you have a better idea to resolve the issue above.

The recent version of PostgreSQL has a "replication slot" feature, which makes possible to omit taking the basebackup when recovering. It might be great if we can improve with those features for more stable and less manual operations.

Thank you,

furynick commented 5 years ago

I encountered the same problem with a two node cluster. When resource is moved from A to B, promote is done so B become master but A cannot be started due to the lock file. I added a test to demote, if replication status is "streaming" the lock is removed as data can be considered consistent. I'll post patch file after some tests.

furynick commented 5 years ago

Here is the patch info to pgsql resource agent I'm using :

*** /usr/lib/ocf/resource.d/heartbeat/pgsql     2019-01-15 12:42:42.262279097 +0100
--- /home/tools/postgres/pgsql   2019-01-16 10:04:35.384597000 +0100
***************
*** 820,825 ****
--- 820,833 ----
          ocf_log info "stop_escalate(or stop_escalate_in_slave) time is adjusted to ${stop_escalate} based on the configured timeout."
      fi

+     # Check streaming status
+     count=`exec_sql "${CHECK_SYNC_SQL}"`
+     if [ $count -eq 0 -a "$1" = "master" ]; then
+         ocf_log info "In sync, removing $PGSQL_LOCK."
+         rm -f $PGSQL_LOCK
+     fi
+
+
      # Stop PostgreSQL, do not wait for clients to disconnect
      if [ $stop_escalate -gt 0 ]; then
              runasowner "$OCF_RESKEY_pgctl -W -D $OCF_RESKEY_pgdata stop -m fast"
***************
*** 1907,1912 ****
--- 1917,1923 ----
          PROMOTE_ME="1000"

          CHECK_MS_SQL="select pg_is_in_recovery()"
+         CHECK_SYNC_SQL="select count(state) from pg_stat_replication where state != 'streaming'"
          CHECK_SYNCHRONOUS_STANDBY_NAMES_SQL="show synchronous_standby_names"
          ocf_version_cmp "$version" "10"
          rc=$?
BrightUser2019 commented 5 years ago

Hello @furynick

Did you test and verify above changes with streaming replication? Will they work with Async replication mode?

Thanks !

furynick commented 5 years ago

Hello @BrightUser2019, these changes are intended for streaming replication and work with both sync and async modes. However, I had some issues and no time yet to investigate. Manual move almost work and previous master becomes a slave, sometimes resources don't move at all, probably a misconfiguration or bad usage of pcs. Automatic failover occurs frequently, monitoring seems to fail on some unclear situations, in that case the demote mainly don't work and slave don't start. Sometimes promote fail too and master don't start conducting to no database at all.

BrightUser2019 commented 5 years ago

Thanks for your reply @furynick. Yes, the behaviour is not uniform or same every time. We noticed that deleting the lock file and executing the pcs resource cleanup works many times but fails sometimes.

We need to know the behaviour for sure so that we can use this on production but unfortunately there is no guidance/manual available for setting up the cluster.

@t-matsuo, any advise? We are are trying to use pacemaker + postgres for active-passive setup. The failover works fine but to get the old node back in cluster, we need to know steps that will work everytime,

The PGSQL.lock file gets created on the new primary however as per some help links it should get created on the failed primary. Why so?

furynick commented 5 years ago

@BrightUser2019 : the lock file is not automatically removed, so it remains on the failed primary.

Did you tried guidance from https://clusterlabs.github.io/PAF/ to setup automatic failover ? Regarding result of the solution on non-production environment we abandonned the automatic failover so I didn't check PAF, perhaps you'll have more luck with it.

BrightUser2019 commented 5 years ago

@furynick , regarding the PGSQL.lock file, the behaviour I am seeing is: Its creating the lock file on the node that is being promoted, not on the the failed node. Confused about the same.

furynick commented 5 years ago

exact, but the failed node, before failing, was also promoted so the lock file was created at that time. the lock is not removed with current resource agent, that's one point of my patch to allow reconnect of the slave automatically.

mankel08 commented 5 years ago

@furynick Facing the same problem with 2 nodes cluster. Did you get a chance to test your patch? You had to make additional changes to pcs?

furynick commented 5 years ago

I tested my patch on non-Production environment with some success.

Manual failover give good results within a couple of seconds, the problem comes from monitor mechanism that fails on some unclear conditions. This result in stopped slave with luck or all nodes stopped without.

mankel08 commented 5 years ago

Thanks for the quick response @furynick. Do you think this is something to do with the migration-threshold pcs property that we set to 1? If fail-count is not reset, the Pacemaker will not attempt to start the Pgsql resource in cases where fail-counts >= migration-threshold

furynick commented 5 years ago

I'm not able tu be sure of that but probably. I also suspect a connectivity loss between nodes or high server load.

knakahira commented 5 years ago

@furynick , excuse me for cutting in. (Matsuo has transferred to another work and our team is now taking over his work)

I think that removing PGSQL.lock should be executed by the start operation, not the stop(demote) operation. It is necessary to prevent to start a node solely with outdated data. Because the old master can not guarantee that his data is up-to-date. If you want to automatically restore the old master, you need to implement a process to delete PGSQL.lock after determining whether it can be synchronized with the new master at startup. (there are many issues to consider and I have no good ideas at the moment)

jairamgauns commented 5 years ago

well i am also setting the 2 node HA with a failover and failback to work and i am following the same steps, so @knakahira as @furynick stated earlier that he is using sync replication which means that data should be the same between the failed master and the slave being promoted to master. Deleting PGSQL.lock should not be a problem right? Or do you see that there could be some data ahead in the failed master that could prevent it from starting as a hot standby even though replication is in sync mode?

knakahira commented 5 years ago

@jairamgauns, PGSQL.lock is necessary to prevent a failed master from starting up as the master again. Prevent it from starting as a hot standby is not the purpose.

The failure case that cause problem as follows.

  1. master hostA crashed and rebooting system.
  2. hot standby hostB promoted as a new master.
  3. new master hostB crashed same as the old master hostA.
  4. rebooted old master hostA promoted as a new master. And the DB data updated in 2-3 will be lost.

Of course, it can be prevented by disabling automatic startup. If the operator is careful, will not accidentally start the failed master solely. However, pgsql RA need to be respond for these problems.

I think that If new synchronizable master alive in the step 4, pgsql RA can remove PGSQL.lock file. This is why I think the lock file should be removed during the start operation.