allanjude / zxfer

A continuation of development on zxfer, a popular script for managing ZFS snapshot replication
BSD 2-Clause "Simplified" License
124 stars 40 forks source link

Replication fails local-to-local #42

Closed kpfleming closed 5 years ago

kpfleming commented 6 years ago

I have zxfer running on a FreeNAS 11.1 machine, successfully pushing and pulling snapshots to a remote FreeBSD 11.1 machine. There are 8 datasets being replicated, with typical weekly/monthly/yearly snapshots being created and replicated (and removed when they are no longer present on the source).

I added an external drive to the FreeNAS machine, created a zpool on it, and created a dataset to hold another copy of all these datasets and snapshots. Using zxfer, I had no trouble replicating the datasets and their snapshots to the new pool. However, when I run the script a second time, expecting that only new snapshots will be transferred (or run it right after a successful run, when there are no snapshots to transfer at all), it fails, because it's trying to transfer all of the snapshots over again. The error from zfs receive is

Sending main/net17-backup/root/default@weekly-2017-12-30_05.17.00--1m to exthd/net17-backup/root/default.cannot receive new filesystem stream: destination has snapshots (eg. exthd/net17-backup/root/default@daily-2017-12-31_03.50.00--1w)
must destroy them to overwrite it

However, a zfs list shows that the named snapshot is already present in the exthd dataset, so I don't understand why zxfer is trying to transfer it again. It appears that zxfer is unable to determine which datasets/snapshots are already present in the target dataset.

For what it's worth, I'm running this as root in order to avoid any permissions issues.

Any clues as to what I can do to figure out why this is failing?

allanjude commented 6 years ago

Do you actually have a common snapshot? Make sure there is a snapshot that is the same on both sides, otherwise it is not possible to do incremental replication.

another thing to check for: Are the snapshots actually the same snapshot, or different snapshots with the same name? Make sure your snapshotting tool is not creating (empty) snapshots on the external drive, with the same naming convention.

You can tell by doing: zfs get guid dataset/name@snapshot

If the GUID is not the same for two snapshots with the same name, then they are actually different snapshots, and you might need to adjust the configuration of your snapshot creation tool.

kpfleming commented 6 years ago

The external drive was empty before this process was started, I created a pool and dataset on it, and the only thing that has written to it is zxfer. The periodic snapshotting tasks in FreeNAS aren't even aware of the new dataset, so there are no 'extra' snapshots there.

I've actually wiped the drive and recreated the pool/dataset twice, to be certain I hadn't done anything incorrectly, and I get the same result each time.

kpfleming commented 6 years ago

Here's some hopefully useful information:

Source pool:

root@nas:~ # zfs list -Hr -t all -o name,guid main/jails
main/jails 9940571956791302849
main/jails@auto-20171230.0100-4w 12649837480595057350
main/jails@auto-20171231.0100-1y 9079358126266723839
main/jails@auto-20180103.0100-1w 16606647760182677267
main/jails@auto-20180104.0100-1w 9593515035979310257
main/jails@auto-20180105.0100-1w 2261898651655459327
main/jails@auto-20180106.0100-4w 16363371700835728568
main/jails@auto-20180108.0100-1w 1532057881464579405
main/jails@auto-20180109.0100-1w 3245831117032508532

Destination pool:

root@nas:~ # zfs list -Hr -t all -o name,guid exthd/jails
exthd/jails 10332811084693113839
exthd/jails@auto-20171230.0100-4w 12649837480595057350
exthd/jails@auto-20171231.0100-1y 9079358126266723839
exthd/jails@auto-20180101.0100-1w 37215645594354199
exthd/jails@auto-20180102.0100-1w 16759178979886751373
exthd/jails@auto-20180103.0100-1w 16606647760182677267
exthd/jails@auto-20180104.0100-1w 9593515035979310257
exthd/jails@auto-20180105.0100-1w 2261898651655459327
exthd/jails@auto-20180106.0100-4w 16363371700835728568

Attempt to replicate:

root@nas:~ # /mnt/main/backups/zxfer -dFPv -N main/jails exthd
Sending main/jails@auto-20171230.0100-4w to exthd/jails.
cannot receive new filesystem stream: destination has snapshots (eg. exthd/jails@auto-20180101.0100-1w)
must destroy them to overwrite it
warning: cannot send 'main/jails@auto-20171230.0100-4w': signal received
Error when zfs send/receiving.

The source pool has some snapshots that haven't been replicated yet, and the destination pool has some outdated snapshots which haven't been removed yet, but the snapshots they have in common have identical names and GUIDs. In spite of that, an attempt to do a replication results in zxfer trying to send a snapshot that is already present.

kpfleming commented 6 years ago

I have isolated the problem. The test on line 688, in get_zfs_list, appears to assume that since LZFS and RZFS are identical (this is a local-to-local replication task), that the list of snapshots on the destination is identical to the list on the source. However in my case the source and destination are different pools.

Forcing this test to always fail (and then obtain the list of snapshots on the destination using RZFS) cures the issue, and my replication job works fine.

allanjude commented 6 years ago

Thanks for figuring this out. I've never used zxfer with two pools on the same system. I just inherited this code when the original maintainer abandoned it.

I'll work up a patch for this issue, if you don't submit a pull request first.

kpfleming commented 6 years ago

This appears to be only an optimization, so I'm happy to send a PR which just removes it.