Add parameter to track highest reached timeline

YanChii commented 7 years ago

Hi,

I have a proposal for additional data consistency measure that will help to prevent wrong promote decisions. The proposal is to create a permanent parameter that will store the highest timeline number that was ever reached in this database cluster. The parameter is saved in post-promote phase and consulted in pre-promote. It will ensure that failed master will never be promoted.

Details: post-promote: save the new timeline value to the crm_config database. Why crm and not private attr: crm parameter is permanent across reboots/crashes, it is node independent and is consistently reachable from any node within quorum partition. Format: crm_attribute --lifetime forever --type crm_config --name "$name" --update "$val" pre-promote: get the timeline value of the local database and compare it to the global highest timeline value. If the local timeline is lower than highest global, abort the promotion (set attr to abort).

Why it is needed:

it will ensure that the failed master (or greatly lagging slave) will never be promoted under any circumstances (even with fencing not configured)
it can help to protect data when master vote parameters are inconsistent (e.g. issue #56 or any future inconsistencies)
it is just additional measure that can be helpful and it is not interfering with current voting mechanisms
it is pre-requisite to auto-rewind of failed masters that I'm considering to implement (I've already thinked it up and I'll open a separate issue to discuss it, but in short: if the local timeline of the DB is lower than global highest timeline during the start of a local resource, we have successfully identified a failed master (or greatly lagging slave) that needs a rewind or basebackup)

I'm in half-way of implementing the global timeline check and I've opened this issue to ask if this sounds desirable to you (my aim is to integrate as many changes as possible back into your project).

Jan

ioguix commented 7 years ago

Hello @YanChii,

This is something I had in mind for a while as well.

However, setting attributes to the CIB during transition breaks it. That means we will loose the recovery and move detection thanks to the notify vars. We should be fairly safe with your patch though, as you set the global TL in post-promote, so after most of the job has been done. A new transition should have almost nothing to do but constraints and collocation related tasks to do I guess.

Did you try your patch with recovery/move scenario? What did you found from logs?

I had some other approaches in mind to achieve this goal. I'll try to work on this tomorrow.

@YanChii, sorry for my last patch conflicting with #59 and the long awaiting for my answers :/

Cheers,

ioguix commented 7 years ago

@YanChii,

About your auto-failback project, the best way to detect a failed master not being able to catchup with the new master it to compare its TL+LSN with TL history. If the old master TL+LSN is greater than the TL fork LSN, then you must use pg_rewind on it.

YanChii commented 7 years ago

Hi @ioguix, thank you for your answer. I have tested the failover scenario and I don't see any problems with failover or switchover. I have not seen any transition problems and it successfully forbids the election of the server with lower timeline. It is true that I'm seeing in logs:

Jan 23 10:30:45 cluster6 crmd[5566]: notice: Transition aborted by cib-bootstrap-options-postgres-6366165775826891940-highest-timeline doing modify postgres-6366165775826891940-highest-timeline=31: Non-status change

But it doesn't affect the functioning. What do you think about it?

Actually I think CIB is not the best place for this parameter, but I don't know any other location that is persistent across reboots and also persistent against node disconnect/rejoin (and thus can be queried by any node any time).

Regarding the TL+LSN (you probably meant automatic rewind+join of old master to the cluster, not auto-failback): I'm not sure I understand. Before you start DB on a node, you simply check if the local TL is lower than the TL written in the CIB. You don't have access to other node's LSNs if you are not in promote phase. Pg_rewind should be run in start or pre-start phase.

Jan

ioguix commented 7 years ago

It is true that I'm seeing in logs:
Jan 23 10:30:45 cluster6 crmd[5566]: notice: Transition aborted by cib-bootstrap-options-postgres-6366165775826891940-highest-timeline doing modify postgres-6366165775826891940-highest-timeline=31: Non-status change
But it doesn't affect the functioning. What do you think about it?

As I wrote, this happen during post-promote action, so most of the work has been done yet, it shouldn't be a problem.

Actually I think CIB is not the best place for this parameter, but I don't know any other location that is persistent across reboots and also persistent against node disconnect/rejoin (and thus can be queried by any node any time).

This is kind of a hack, but Ken Gaillot gave a solution to create private attributes using a node that will never exists in the cluster, see:

https://www.mail-archive.com/users@clusterlabs.org/msg03683.html

I suppose you could use a non-existant node name known by all your node, where you could set this private attribute. Because this attribute will not survive a reboot, you will have to create it upon cluster startup. did you study this solution?

Another way would be to use the "Storable" module to store locally on each node this information in a file. This has been used in v1.1 release if you need some code sample, see: 5741b74e54b7aa5df00c8bf4f60b45fdea8db3ac

Regarding the TL+LSN (you probably meant automatic rewind+join of old master to the cluster, not auto-failback):

OK, I didn't meant an auto-failback all the way to getting the old master back as master, but just getting it back in the cluster as a standby. You are right.

I'm not sure I understand. Before you start DB on a node, you simply check if the local TL is lower than the TL written in the CIB. You don't have access to other node's LSNs if you are not in promote phase. Pg_rewind should be run in start or pre-start phase.

Checking if the TL of the old master is lower than the current "cluster TL" is not enough to decide if we need pg_rewind or not. As instance, during switchover, ALL other standby have a TL lower than the new master and we don't need to pg_rewind on them because before the promotion, they were all lagging or in sync with the new master.

But anyway, I suppose you could always call pg_rewind whenever the master change and removing the dead code in PAF regarding the switchover logic to keep things clean and simple. I suppose pg_rewind doesn't hurt if the node to get back in replication with the new master were already lagging behind it before the TL fork.

YanChii commented 7 years ago

Hi @ioguix,

This is kind of a hack, but Ken Gaillot gave a solution to create private attributes using a node that will never exists in the cluster, see:

https://www.mail-archive.com/users@clusterlabs.org/msg03683.html

I suppose you could use a non-existant node name known by all your node, where you could set this private attribute. Because this attribute will not survive a reboot, you will have to create it upon cluster startup. did you study this solution?

Yes, I've seen this conversation. From your proposal, I see these approaches: 1) A variable will be stored in nonexistent nodename var and if it doesn't exist during the start of a resource, it will be created by querying pg_controldata. 2) A variable will be stored only in some local file on all nodes. It will be updated everywhere in the post-promote phase. 3) A variable will be stored in nonexistent nodename and also in some local file on all nodes. The local file will be loaded at cluster startup to restore the lost var. 4) A variable will be stored in crm_config, as implemented in #59.

A few thoughts on this, correct me if I'm wrong:

the variable will be lost only on full cluster shutdown. If at least one node remains up (even observer or no-quorum member), the variable will still be there. Correct?
that leaves us only to inspect the cluster start phase
what are the possibilities that the parameter will not be restored correctly?
the only possibility I can currently imagine is: when a failed master comes online as a first node, it will have the lower TL number stored in the variable. I don't know how to solve this as the cluster data on this node can be completely obsolete. We can check the highest TL as other members come online later and update the variable if necessary (and hope that the old master didn't get promoted yet).

Conclusion: Using nonexistent nodename seems also a bit hacky to me (and also a bit fragile) and I still consider permanently stored CRM attribute as a bit safer and easier solution. But it can be done. Even approach 1) will probably be sufficient.

What's your opinion?

But anyway, I suppose you could always call pg_rewind whenever the master change and removing the dead code in PAF regarding the switchover logic to keep things clean and simple.

You don't need to call pg_rewind on master change. The only place to call pg_rewind is in pre-start phase of the resource on the node that has lower TL. Not in check/probe phase, not after master change. After the pre-start->pg_rewind->start sequence, you are sure that slave will catch up and you don't need to do anything else until you are about to start a resource again.

ioguix commented 7 years ago

Hi,

I started a new branch on my repo to work on the TL check during failover for a safer election code.

You can find a diff of my current code there: https://github.com/dalibo/PAF/compare/master...ioguix:check_tl_during_election

As you will notice, I don't track the highest TL existing in the cluster, I only compare the TL of remaining standbies during election to make sure the selected one to promote has the highest existing TL. It seems to me we don't need to track the highest TL in PAF to achieve this goal.

Considering your point about taking care the old master can not be promoted by mistake, I'm still on the fenced to decide if it's a real threat or not.

Last, you wrote:

if the local timeline of the DB is lower than global highest timeline during the start of a local resource, we have successfully identified a failed master (or greatly lagging slave) that needs a rewind or basebackup

This is not true. A clean standby can have a lower TL and will not need pg_rewind to catch up with the new master.

You don't need to call pg_rewind on master change.

Yes, I know, that's how PAF works today :)

The only place to call pg_rewind is in pre-start phase of the resource on the node that has lower TL.

Wrong. After a switchover, all standbies have a lower TL until they catch up with the new master (either by WAL shipping or Streaming rep).

Not in check/probe phase, not after master change. After the pre-start->pg_rewind->start sequence, you are sure that slave will catch up and you don't need to do anything else until you are about to start a resource again.

I'm not sure to understand you. But what I was trying to point out is that maybe you can call pg_rewind every time, even when everything looks clean. I guess it will not hurt if there is no rewind work to do and your code will just be simpler. Mind you, maybe pg_rewind can actually check if there is a need to rewind the local instance using a dry run?

Anyway, I'll try to be on IRC this week in chatroom #clusterlabs on freenode.net if you need to discuss this.

ioguix commented 7 years ago

Hi @YanChii ,

Following our discussion on IRC, I started a branch in my repository to allow more than one PAF resource in the same cluster. My limited tests so far sounds good. I'll plan to merge in few days.

See: https://github.com/dalibo/PAF/compare/master...ioguix:rsc_name_in_attr

Regards,

YanChii commented 7 years ago

Your solution to rename the variables is quite elegant and simple. I'm currently building new virtual environment to test your changes. After the test I'll put in into the production as I already have a deployment with multiple PAF resources. Thanks for the quick fix. Jan

ClusterLabs / PAF

Add parameter to track highest reached timeline #57