Closed wxcafe closed 4 years ago
Hi again,
not sure what you are trying to achieve here. You may be confused as to what refreshpull does. And even if you do understand it, I think you're going about things in an unsafe manner.
In my opinion, "lastsent" should always be checked only on the master. zrep_refresh is pulling from the "slave" side. So it isnt safe for it to trust its own "lastsent" copied value. by definition, it is out of date.
Point #1: refreshpull already checks for "last sent" snap on the master.
Point #2: Someone may have chosen to roll back the "lastsent" snap pointer on the master. Dont want the slave to then ignore that.
Master values should always take precedence over slave values.
On Mon, Apr 27, 2020 at 12:17 PM wxcafé notifications@github.com wrote:
rough first draft, untested: zrep_refresh() sends the last snapshot it has as an argument to _refreshpull, and that's used as the source of the incremental send
You can view, comment on, or merge this pull request online at:
https://github.com/bolthole/zrep/pull/157 Commit Summary
- WIP: refresh use local snapshot history
File Changes
Patch Links:
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bolthole/zrep/pull/157, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANEV6MJTS4KW3SRDXVGNOTROXK5PANCNFSM4MSGCZAQ .
Well so
Let's say the primary server is the one currently in prod, that I'm taking a backup of, right?
The replica server is the one that's calling zrep refresh
, where the backups reside
As I state in my issue #155, if there's an error during a zrep refresh
run, not on the primary (that's running zfs send
) but on the replica (a receive error), the primary will mark the latest snapshot as sucessfully sent (and it /was/, technically, I guess, it just wasn't successfully received). So next time the replica runs zrep refresh
, it'll do _refreshpull
on the primary, and the primary will try to send an incremental diff between the previous snapshot that it thinks was successfully sent (i.e. the one that the replica didn't actually sucessfully receive) and the snapshot it just took in the refreshpull
call. That will fail, and then no subsequent zrep refresh
runs will complete because they'll all try to run from that last snapshot that the primary thinks was synced (but wasn't)
What I'm trying to do here is have the replica specify which snapshot it has, so that the primary generates an incremental diff between that and the snapshot it just did. That prevents the situation I just described where zrep refresh
always fail
I think a better way to approach this problem is to put in some kind of additional check so that the lastsent property, is only set when it is ACTUALLY received fully.
possibly the most appropriate way to do this, is to remove the 'zfs set' from _refreshpull, and then zrep_refresh() would do an explicit extra ssh call when the receive finishes. As a side effect, I would implement "zrep setlastsent", which may have additional uses for people.
Thoughts?
On Mon, Apr 27, 2020 at 2:07 PM wxcafé notifications@github.com wrote:
Well so
Let's say the primary server is the one currently in prod, that I'm taking a backup of, right? The replica server is the one that's calling zrep refresh, where the backups reside
As I state in my issue #155, if there's an error during a zrep refresh run, not on the primary (that's running zfs send) but on the replica (a receive error), the primary will mark the latest snapshot as sucessfully sent (and it /was/, technically, I guess, it just wasn't successfully received). So next time the replica runs zrep refresh, it'll do _refreshpull on the primary, and the primary will try to send an incremental diff between the previous snapshot that it thinks was successfully sent (i.e. the one that the replica didn't actually sucessfully receive) and the snapshot it just took in the refreshpull call. That will fail, and then no subsequent zrep refresh runs will complete because they'll all try to run from that last snapshot that the primary thinks was synced (but wasn't)
What I'm trying to do here is have the replica specify which snapshot it has, so that the primary generates an incremental diff between that and the snapshot it just did. That prevents the situation I just described where zrep refresh always fail
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
well, that was my other suggestion in #155 so that sounds good to me too, this just sounded easier so I did a rough draft to see how it'd work :)
Please try out the latest
( 8496e27 )
It's always a bit hard to check things like that, that happen when a random receive error is triggered, but I induced one by killing zfs recv
mid-receive and it seems to work well. Thanks a lot for this fix, for your very quick replies, and for the script in general!
Hopefully you verified that it also works normally for you as well? :)
On Mon, Apr 27, 2020 at 5:54 PM wxcafé notifications@github.com wrote:
Closed #157 https://github.com/bolthole/zrep/pull/157.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bolthole/zrep/pull/157#event-3278139832, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANEV6N54MJH6BIZ5U7CE23ROYSOBANCNFSM4MSGCZAQ .
Haha yes, I did that too :) Thanks
Thank you! :)
On Mon, Apr 27, 2020 at 6:44 PM wxcafé notifications@github.com wrote:
Haha yes, I did that too :) Thanks
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
rough first draft, untested:
zrep_refresh()
sends the last snapshot it has as an argument to_refreshpull
, and that's used as the source of the incremental send