bolthole / zrep

ZREP ZFS based replication and failover script from bolthole.com
Other
251 stars 57 forks source link

WIP: refresh use local snapshot history #157

Closed wxcafe closed 4 years ago

wxcafe commented 4 years ago

rough first draft, untested: zrep_refresh() sends the last snapshot it has as an argument to _refreshpull, and that's used as the source of the incremental send

ppbrown commented 4 years ago

Hi again,

not sure what you are trying to achieve here. You may be confused as to what refreshpull does. And even if you do understand it, I think you're going about things in an unsafe manner.

In my opinion, "lastsent" should always be checked only on the master. zrep_refresh is pulling from the "slave" side. So it isnt safe for it to trust its own "lastsent" copied value. by definition, it is out of date.

Point #1: refreshpull already checks for "last sent" snap on the master.

Point #2: Someone may have chosen to roll back the "lastsent" snap pointer on the master. Dont want the slave to then ignore that.

Master values should always take precedence over slave values.

On Mon, Apr 27, 2020 at 12:17 PM wxcafé notifications@github.com wrote:

rough first draft, untested: zrep_refresh() sends the last snapshot it has as an argument to _refreshpull, and that's used as the source of the incremental send

You can view, comment on, or merge this pull request online at:

https://github.com/bolthole/zrep/pull/157 Commit Summary

  • WIP: refresh use local snapshot history

File Changes

Patch Links:

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bolthole/zrep/pull/157, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANEV6MJTS4KW3SRDXVGNOTROXK5PANCNFSM4MSGCZAQ .

wxcafe commented 4 years ago

Well so

Let's say the primary server is the one currently in prod, that I'm taking a backup of, right? The replica server is the one that's calling zrep refresh, where the backups reside

As I state in my issue #155, if there's an error during a zrep refresh run, not on the primary (that's running zfs send) but on the replica (a receive error), the primary will mark the latest snapshot as sucessfully sent (and it /was/, technically, I guess, it just wasn't successfully received). So next time the replica runs zrep refresh, it'll do _refreshpull on the primary, and the primary will try to send an incremental diff between the previous snapshot that it thinks was successfully sent (i.e. the one that the replica didn't actually sucessfully receive) and the snapshot it just took in the refreshpull call. That will fail, and then no subsequent zrep refresh runs will complete because they'll all try to run from that last snapshot that the primary thinks was synced (but wasn't)

What I'm trying to do here is have the replica specify which snapshot it has, so that the primary generates an incremental diff between that and the snapshot it just did. That prevents the situation I just described where zrep refresh always fail

ppbrown commented 4 years ago

I think a better way to approach this problem is to put in some kind of additional check so that the lastsent property, is only set when it is ACTUALLY received fully.

possibly the most appropriate way to do this, is to remove the 'zfs set' from _refreshpull, and then zrep_refresh() would do an explicit extra ssh call when the receive finishes. As a side effect, I would implement "zrep setlastsent", which may have additional uses for people.

Thoughts?

On Mon, Apr 27, 2020 at 2:07 PM wxcafé notifications@github.com wrote:

Well so

Let's say the primary server is the one currently in prod, that I'm taking a backup of, right? The replica server is the one that's calling zrep refresh, where the backups reside

As I state in my issue #155, if there's an error during a zrep refresh run, not on the primary (that's running zfs send) but on the replica (a receive error), the primary will mark the latest snapshot as sucessfully sent (and it /was/, technically, I guess, it just wasn't successfully received). So next time the replica runs zrep refresh, it'll do _refreshpull on the primary, and the primary will try to send an incremental diff between the previous snapshot that it thinks was successfully sent (i.e. the one that the replica didn't actually sucessfully receive) and the snapshot it just took in the refreshpull call. That will fail, and then no subsequent zrep refresh runs will complete because they'll all try to run from that last snapshot that the primary thinks was synced (but wasn't)

What I'm trying to do here is have the replica specify which snapshot it has, so that the primary generates an incremental diff between that and the snapshot it just did. That prevents the situation I just described where zrep refresh always fail

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

wxcafe commented 4 years ago

well, that was my other suggestion in #155 so that sounds good to me too, this just sounded easier so I did a rough draft to see how it'd work :)

ppbrown commented 4 years ago

Please try out the latest

( 8496e27 )

wxcafe commented 4 years ago

It's always a bit hard to check things like that, that happen when a random receive error is triggered, but I induced one by killing zfs recv mid-receive and it seems to work well. Thanks a lot for this fix, for your very quick replies, and for the script in general!

ppbrown commented 4 years ago

Hopefully you verified that it also works normally for you as well? :)

On Mon, Apr 27, 2020 at 5:54 PM wxcafé notifications@github.com wrote:

Closed #157 https://github.com/bolthole/zrep/pull/157.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bolthole/zrep/pull/157#event-3278139832, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANEV6N54MJH6BIZ5U7CE23ROYSOBANCNFSM4MSGCZAQ .

wxcafe commented 4 years ago

Haha yes, I did that too :) Thanks

ppbrown commented 4 years ago

Thank you! :)

On Mon, Apr 27, 2020 at 6:44 PM wxcafé notifications@github.com wrote:

Haha yes, I did that too :) Thanks

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.