Open dyoakley opened 6 years ago
Hi,
Master scores are delivered by the current master of your cluster based on its local data from pg_stat_replication
. The scores are sorted by the highest write_lsn
or (write_location
for pg < v10) it received from the standbies. The write location is one step after pg_last_xlog_receive_location
.
Of course, you can have some lags between what the master knows and what the standby really received. Moreover, this knowledge might be wrong an instant later. That's why during failover, the designated master recheck its last LSN with other existing standby to make sure it is the best one.
Yes that is what I was assuming, that the failover as designed is doing the right thing. The reason I wrote the bug is because the bid cycling and the noisy INFO level spamming it produces. For certain in the prioritized sync commit case, the highest priority standby will always be the best choice so any other scoring is just a distraction. Perhaps that log message is no longer needed or alternatively it should just be lowered to debug level. Thoughts?
In fact, there is a lot a asynchronous piece in this puzzle:
All in all, you could perfectly have a non synchronous standby having synch'ed all WAL entries sent before the one being declared synchronous...
Maybe you could try to specify the master must be sync with your 2 standbies? https://www.postgresql.org/docs/9.6/static/runtime-config-replication.html#GUC-SYNCHRONOUS-STANDBY-NAMES
Thanks for the insight.
Keep in mind I am interested in supporting minimal data loss. Performance is secondary to that. I perceive the way to do that with postgres is with sync commit. Bringing in a agent to dynamically pick a slave in any general sort of way is problematic but it seems to me that by using priority commit we can bring some order to our chaotic distributed world if willing to trade-off some performance for that, which I am (and I would think anyone seriously considering sync commit would also).
So let me frame this again. It's true there are lots of places for async behavior due to the nature of how the pieces fit together. However if I enable sync commit with priority then it seems like we know exactly which online standby should be cutover because we have told the master which standby that must be synced before a commit will complete (if that standby is online). I am suggesting 1) this is a reasonable and helpful configuration option 2) an agent like PAF could take advantage of these constraints since the information is all in the postgres.conf file, and behave accordingly. This means all pg agents can get to consensus quickly on which standby to attempt to promote without a lot of querying because the priority of the standbys is static.
I think this is what it means to an agent that wants to support priority-sync-commit 1) no need to do lag based score updates. Score is determined by online state + the priority order of the standby hosts/apps. 2) On master fail, attempt to cut over to the highest priority online standby
To digress a little, I apologize if stating the obvious; load balancing / performance vs minimal data loss are competing requirements; can't have both. But it seems like we can choose which one we want and let the agent act accordingly. Or maybe we need a different agent altogether but it seems like there is a lot in common with the current PAF agent and it seems like he is attempting to handle valid postgres configurations.
If we can agree on this use case and a general approach, and I can get a little guidance as Im not yet an agent or a PAF guru, I'm happy to implement and contribute back.
Comments to your notes are below.
In fact, there is a lot a asynchronous piece in this puzzle:
- the standbies feedbacks are not in simultaneous Yes but sync commit will not complete until the standby has received and optionally written to disk (see next item). So this is another good reason (for the priority-sync-commit configuration) to not depend on the standbys reported replication state but only the masters view of the standbys
- the standbys sync of WAL on disk are not simultaneous I assume you mean "instantaneous". If so, this is configurable. You can decide with sync commit whether you want commit to return when data is on the standby node (sync commit=remote_write) OR when data is also written to disk on the standby (sync commit=on). For separation of concerns reasons, this is a good thing. Gives reliable/performance trade-off choices. I might be happy my master commit has just made it to the first standby or I might require to wait longer so its safely on disk).
- the loads on standbies are not the same True but I'm not sure this is relevant given my use case prefers to optimize to minimize data loss, not maximize performance.
All in all, you could perfectly have a non synchronous standby having synch'ed all WAL entries sent before the one being declared synchronous... Yes but again the point is not whether the standby has synced the data he knows about but rather that the master "contract" is he will guarantee the standby that is first in the sync standby names list will have any data written to it before completing a commit.
Maybe you could try to specify the master must be sync with your 2 standbies? https://www.postgresql.org/docs/9.6/static/runtime-config-replication.html#GUC-SYNCHRONOUS-STANDBY-NAMES
Yes I considered that but it also seems to complicate things in other ways that worry me. I think it requires 2 standbys online at all times or a master commit would hang. Also means to wait for each to complete before completing the master commit. That seems like a big hit when really all I require is that each commit be synced to one standby (the first standby) providing reasonable protection from data loss in case of master failover.
Hi,
Sorry for getting back to you so lately, I just realized this answer was on my todo-list since 4 weeks now :/
I apologize if stating the obvious; load balancing / performance vs minimal data loss are competing requirements; can't have both
We are agree.
the standbys sync of WAL on disk are not simultaneous
I assume you mean "instantaneous".
No, I meant "simultaneous". They do not sync WAL all in the same time on disk.
But anyway, this is the most important in this discussion I think:
All in all, you could perfectly have a non synchronous standby having synch'ed all WAL entries sent before the one being declared synchronous...
Yes but again the point is not whether the standby has synced the data he knows about but rather that the master "contract" is he will guarantee the standby that is first in the sync standby names list will have any data written to it before completing a commit.
My main concern is race-condition here. If another standby received more wals from the master than the one in sync priority, you'll have to rebuild or rewind it. No concern about transactions though, obviously. You would have to deal with this, but I feel like you are ready for this. So let's move on.
Next, the master scores are distributed to standbies from the current pgsql master during its recurring "monitor" action. If you want to set the highest score to the standby in sync with the better priority whatever its lag, should a failover happen, you will have to recheck the status among the cluster because the standbies priority/sync might have change in the meantime between the last monitor action on the master and the failure on the master. In short: we'll probably not be able to shortcut the election process.
If we want to pay attention to the priority/status during a failover when the synchronous replication is setup, we will have to be able to find exactly what standby had the best priority and was in sync after the master disappeared. That means standbies would have to compare their last known status instead of their last received LSN/location.
Right now, I'm not sure if this is possible with existing stats in core. We might have to implement an extension to report this from standbys. And even if we consider this option, not sure the information is available on standby side thought :/
Discussion around this issue is appreciated, do not hesitate if you have some more tech ideas about this. Moreover, if you have some time to dig how we could gather last known status from the standbies themselve with precision, do not hesitate to share your information here. I might be able to find some limited time next week to experiment a little thought.
PAF Version: 2.2.0-2, 3 node cluster, sync commit, fixed priority
First I've implemented sync commit with the intention that I reduce or eliminate data loss on master failover to standby.
I implemented a fixed priority for my standbys (pghost1, pghost2, pghost3) thinking that would make things a bit more reliable on failover so that 1 standby would always be the most current and become the master and the 2nd standby could then immediately catch up and become the new standby as soon as it re-connected to the new master. IOW, a failover would lead to only a few seconds where the cluster did not have a functional standby. That is pretty much the case from my testing so far with PAF and sync commit and its almost a thing of beauty. Not to digress too far on this, but I'm pretty sure that its conceptually feasible to have this same behavior without prioritized commit; I'm just trying things to see what actually works and so far, this seems to be behaving better than just sync commit and letting postgres pick which standby to write to arbitrarily. I really think it all comes down to how PAF picks the standby and the timing of that decision (yeah I know distributed decisions are hard because facts change).
However, I've noticed something in the logging that concerns me. Turns out this happens with or without sync commit. PAF is apparently adjusting resource scores just based on lag.
Below master is pghost1
It's not yet clear to me PAF is compatible with prioritized sync write, without avoidable data loss (e.g. it picking a server that has lower lag but is in fact farther behind). It seems to me, if I want to optimize to minimize data loss, the score should be based on replication state (pg_last_xlog_receive_location?). Please advise. I'm not certain I am right about this but mostly reacting the log messages above and of course my desire to minimize data loss on failover.