allanjude / zxfer

A continuation of development on zxfer, a popular script for managing ZFS snapshot replication
BSD 2-Clause "Simplified" License
123 stars 40 forks source link

zxfer doesn't transfer old snapshots #29

Open DarwinAwardWinner opened 8 years ago

DarwinAwardWinner commented 8 years ago

I've been testing out zxfer as a way to backup the complete state of all filesystems, including all their snapshots, to an external hard drive. However, zxfer doesn't seem to be transferring old snapshots. It seems to have transferred all snapshots on the filesystem that I specified, but on the child filesystems I think it's only transferring snapshots that were made after the first time I ran zxfer. I am running zxfer as:

zxfer -dFkPv -R rpool/fs backuppool/pools

As it's running, it only mentions snapshots that were created/destroyed since the previous run, and doesn't mention the earlier snapshots at all. To count the snapshots on each filesystem, I use:

# Function to count the snapshots on each filesystem
countsnaps () {
    zfs list -r "$1" -H -o name | while read fsname; do
        snapcount=$(zfs list -r "$fsname" -t snapshot | grep "$fsname@" | wc -l)
        echo "$fsname: $snapcount"
    done
}

Counting the snapshots on the source (rpool/fs) and destination (backuppool/pools/fs):

$ countsnaps rpool/fs
rpool/fs: 53
rpool/fs/TimeCapsule: 40
rpool/fs/home: 68
rpool/fs/home/ryan: 68
rpool/fs/home/ryan/Downloads: 40
rpool/fs/home/ryan/syncthing: 40
rpool/fs/mneme-root: 61
rpool/fs/opt: 65
rpool/fs/transmission-daemon: 40
$ countsnaps backuppool/pools/fs
backuppool/pools/fs: 53
backuppool/pools/fs/TimeCapsule: 15
backuppool/pools/fs/home: 15
backuppool/pools/fs/home/ryan: 15
backuppool/pools/fs/home/ryan/Downloads: 15
backuppool/pools/fs/home/ryan/syncthing: 15
backuppool/pools/fs/mneme-root: 15
backuppool/pools/fs/opt: 15
backuppool/pools/fs/transmission-daemon: 15

As you can see, all snapshots for rpool/fs are backed up, but only about 15 snapshots from its child filesystems are backed up, despite the fact that they all have mant more. For example, I'll list only the snapshots for rpool/fs/home and its destination backuppool/pools/fs/home using the following function:

listsnaps () {
    fsname="$1"
    zfs list -r "$fsname" -H -o name -t snapshot | grep "${fsname}@"
}
$ listsnaps rpool/fs/home
rpool/fs/home@zfsnap-monthly-2015-02-01_14.52.00--2y
rpool/fs/home@zfsnap-monthly-2015-04-01_13.52.00--2y
rpool/fs/home@zfsnap-monthly-2015-05-01_13.52.00--2y
rpool/fs/home@zfsnap-monthly-2015-06-01_13.52.00--2y
rpool/fs/home@zfsnap-monthly-2015-07-01_13.52.00--2y
rpool/fs/home@zfsnap-monthly-2015-09-01_13.53.00--2y
rpool/fs/home@zfsnap-monthly-2015-09-01_13.56.00--2y
rpool/fs/home@zfsnap-monthly-2015-10-01_13.52.00--2y
rpool/fs/home@zfsnap-monthly-2015-10-01_13.53.00--2y
rpool/fs/home@zfsnap-monthly-2015-11-01_14.52.00--2y
rpool/fs/home@zfsnap-monthly-2015-12-01_14.52.00--2y
rpool/fs/home@zfsnap-monthly-2016-01-01_14.52.00--2y
rpool/fs/home@zfsnap-monthly-2016-01-01_14.55.00--2y
rpool/fs/home@zfsnap-monthly-2016-01-15_23.39.00--2y
rpool/fs/home@zfsnap-yearly-2016-01-15_23.39.00--10y
rpool/fs/home@zfsnap-monthly-2016-02-16_00.17.00--2y
rpool/fs/home@zfsnap-monthly-2016-03-16_01.17.00--2y
rpool/fs/home@zfsnap-monthly-2016-04-16_02.17.00--2y
rpool/fs/home@zfsnap-monthly-2016-05-16_03.17.00--2y
rpool/fs/home@zfsnap-weekly-2016-05-19_07.17.00--2m
rpool/fs/home@zfsnap-weekly-2016-05-26_07.21.00--2m
rpool/fs/home@zfsnap-weekly-2016-06-02_08.17.00--2m
rpool/fs/home@zfsnap-weekly-2016-06-09_09.17.00--2m
rpool/fs/home@zfsnap-monthly-2016-06-16_04.17.00--2y
rpool/fs/home@zfsnap-weekly-2016-06-16_10.17.00--2m
rpool/fs/home@zfsnap-yearly-2016-06-19_00.17.00--10y
rpool/fs/home@zfsnap-weekly-2016-06-23_11.17.00--1m
rpool/fs/home@zfsnap-weekly-2016-06-30_12.17.00--1m
rpool/fs/home@zfsnap-weekly-2016-07-07_13.17.00--1m
rpool/fs/home@zfsnap-daily-2016-07-09_20.17.00--1w
rpool/fs/home@zfsnap-daily-2016-07-10_21.17.00--1w
rpool/fs/home@zfsnap-daily-2016-07-11_22.17.00--1w
rpool/fs/home@zfsnap-daily-2016-07-12_23.17.00--1w
rpool/fs/home@zfsnap-daily-2016-07-14_00.17.00--1w
rpool/fs/home@zfsnap-weekly-2016-07-14_14.17.00--1m
rpool/fs/home@zfsnap-daily-2016-07-15_01.17.00--1w
rpool/fs/home@zfsnap-hourly-2016-07-15_09.17.00--1d
rpool/fs/home@zfsnap-hourly-2016-07-15_10.17.00--1d
rpool/fs/home@zfsnap-hourly-2016-07-15_11.17.00--1d
rpool/fs/home@zfsnap-hourly-2016-07-15_12.17.00--1d
rpool/fs/home@zfsnap-hourly-2016-07-15_13.17.00--1d
rpool/fs/home@zfsnap-hourly-2016-07-15_14.17.00--1d
rpool/fs/home@zfsnap-hourly-2016-07-15_15.17.00--1d
rpool/fs/home@zfsnap-hourly-2016-07-15_16.17.00--1d
rpool/fs/home@zfsnap-hourly-2016-07-15_17.17.00--1d
rpool/fs/home@zfsnap-hourly-2016-07-15_18.17.00--1d
rpool/fs/home@zfsnap-hourly-2016-07-15_19.17.00--1d
rpool/fs/home@zfsnap-hourly-2016-07-15_20.17.00--1d
rpool/fs/home@zfsnap-hourly-2016-07-15_21.17.00--1d
rpool/fs/home@zfsnap-hourly-2016-07-15_22.17.00--1d
rpool/fs/home@zfsnap-hourly-2016-07-15_23.17.00--1d
rpool/fs/home@zfsnap-hourly-2016-07-16_00.17.00--1d
rpool/fs/home@zfsnap-hourly-2016-07-16_01.17.00--1d
rpool/fs/home@zfsnap-hourly-2016-07-16_02.17.00--1d
rpool/fs/home@zfsnap-daily-2016-07-16_02.17.00--1w
rpool/fs/home@zfsnap-hourly-2016-07-16_03.17.00--1d
rpool/fs/home@zfsnap-hourly-2016-07-16_04.17.00--1d
rpool/fs/home@zfsnap-hourly-2016-07-16_05.17.00--1d
rpool/fs/home@zfsnap-monthly-2016-07-16_05.17.00--1y
rpool/fs/home@zfsnap-hourly-2016-07-16_06.17.00--1d
rpool/fs/home@zfsnap-frequent-2016-07-16_07.15.00--90M
rpool/fs/home@zfsnap-hourly-2016-07-16_07.17.00--1d
rpool/fs/home@zfsnap-frequent-2016-07-16_07.30.00--90M
rpool/fs/home@zfsnap-frequent-2016-07-16_07.45.00--90M
rpool/fs/home@zfsnap-frequent-2016-07-16_08.00.00--90M
rpool/fs/home@zfsnap-frequent-2016-07-16_08.15.00--90M
rpool/fs/home@zfsnap-hourly-2016-07-16_08.17.00--1d
rpool/fs/home@zfsnap-frequent-2016-07-16_08.30.00--90M
$ listsnaps backuppool/pools/fs/home
backuppool/pools/fs/home@zfsnap-hourly-2016-07-16_02.17.00--1d
backuppool/pools/fs/home@zfsnap-daily-2016-07-16_02.17.00--1w
backuppool/pools/fs/home@zfsnap-hourly-2016-07-16_03.17.00--1d
backuppool/pools/fs/home@zfsnap-hourly-2016-07-16_04.17.00--1d
backuppool/pools/fs/home@zfsnap-hourly-2016-07-16_05.17.00--1d
backuppool/pools/fs/home@zfsnap-monthly-2016-07-16_05.17.00--1y
backuppool/pools/fs/home@zfsnap-hourly-2016-07-16_06.17.00--1d
backuppool/pools/fs/home@zfsnap-frequent-2016-07-16_07.15.00--90M
backuppool/pools/fs/home@zfsnap-hourly-2016-07-16_07.17.00--1d
backuppool/pools/fs/home@zfsnap-frequent-2016-07-16_07.30.00--90M
backuppool/pools/fs/home@zfsnap-frequent-2016-07-16_07.45.00--90M
backuppool/pools/fs/home@zfsnap-frequent-2016-07-16_08.00.00--90M
backuppool/pools/fs/home@zfsnap-frequent-2016-07-16_08.15.00--90M
backuppool/pools/fs/home@zfsnap-hourly-2016-07-16_08.17.00--1d
backuppool/pools/fs/home@zfsnap-frequent-2016-07-16_08.30.00--90M

The destination copy only has filesystems created at 2:17 AM on 2016-07-16 or later, which I believe is all the snapshots made after the first time I ran the zxfer command mentioned above. Am I missing something about how to get zxfer to transfer old snapshots, or is this a bug?

allanjude commented 8 years ago

It depends on your before/after state.

In ZFS it is not possible to 'backfill' snapshots. You cannot transfer a snapshot that is older than the current snapshots on the destination. So on a second run, it can only work forward from where it was last time, it cannot go backwards.

When the do the very first replication, zxfer transfers the oldest snapshot, and then works forward from there.

You might need to check the 'grandfather' setting, since you do have some quite old snapshots.

So it seems it replicates all of the snapshots of the parent dataset, but then only some for the children? I have not seen this behaviour in my production use of zxfer.

This might be a bug, but more detail will need to be figured out to identify where the bug may be.

DarwinAwardWinner commented 8 years ago

Maybe the first run of zxfer died before it could transfer all the older snapshots, and somehow this prevents subsequent runs from also transferring those snapshots? I'll try starting from scratch and see if it happens again.

DarwinAwardWinner commented 8 years ago

I have a guess as to what might be happening. I have a cronjob that creates a "frequent" snapshot every 15 minutes, and then deletes all but the last 6 such snapshots. If this cronjob runs while zxfer is running, zxfer will fail to find the snapshot that gets deleted, and will crash after syncing all snapshots up to 90 minutes ago. When I run it the next time, it will only look at the root filesystem to determine which snapshots need to be synced. Since all snapshots up to 90 minutes ago are already present for the root fs, it assumes without checking that they are also present for all the child filesystems, which is not true because the previous run crashed before transferring the child filesystems.

So, that's my guess as to what's happening. Assuming I'm right, there's a couple of things that could potentially mitigate this issue:

  1. Have an option to skip snapshots that have disappeared since zxfer started running without crashing.
  2. Check for snapshots recursively instead of just checking the root.
  3. On my end, I can use GNU parallel's "sem" command to make sure that snapshot deletion never occurs while zxfer is running.
  4. As a hack to fix this once it's already happened, I think I can delete all the snapshots on the destination root fs (which is empty, only the children have data on them). This will convince zxfer to check for new snapshots in all child filesystems.
DarwinAwardWinner commented 8 years ago

Another possibility: transfer snapshots of child filesystems first.

DarwinAwardWinner commented 8 years ago

For what it's worth, I think you could probably reproduce this bug by first running a zxfer in non-recursive mode and then running again in recursive mode.

SnapshotCiTy commented 3 years ago

I am seeing a strange behaviour that could be coming from the same problem as in this post:

Replicating an entire dataset + children (Backup0001/ABCH029/Global) to a remote machine (Pool) with : zxfer -vdF -T "-p60220 root@abch043.dlsa.ch" -R Backup0001/ABCH029/Global Pool

Most filesystems get replicated with all their snapshots. But some, at the destination, only have a few of the latest snapshots

Source snapshots for filesystem Backup0001/Global

Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-10-23-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-10-24-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-10-25-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-10-26-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-10-27-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-10-28-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-10-29-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-10-30-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-10-31-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-11-01-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-11-02-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-11-03-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-11-04-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-11-05-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-11-06-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-11-07-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-11-08-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-11-09-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-11-10-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-11-11-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-11-12-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-11-13-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-11-14-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-11-15-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-11-16-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-11-17-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-11-18-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-11-19-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-11-20-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-11-21-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-11-22-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-11-23-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-11-24-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-11-25-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-11-26-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-11-27-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-11-28-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-11-29-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-11-30-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-12-01-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-12-02-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-12-03-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-12-04-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-12-05-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-12-06-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-12-07-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-12-08-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-12-09-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-12-10-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-12-11-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-12-12-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-12-13-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-12-14-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-12-15-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-12-16-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-12-17-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-12-18-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-12-19-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-12-20-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_daily-2020-12-21-23h15 
Backup0001/ABCH029/Global/Storage@zfs-auto-snap_hourly-2020-12-22-09h00

Snapshots at the destination after running the zxfer command several time (no error):

Pool/Global/Storage@zfs-auto-snap_daily-2020-12-19-23h15 
Pool/Global/Storage@zfs-auto-snap_daily-2020-12-20-23h15 
Pool/Global/Storage@zfs-auto-snap_daily-2020-12-21-23h15 
Pool/Global/Storage@zfs-auto-snap_hourly-2020-12-22-09h00
allanjude commented 3 years ago

I am seeing a strange behaviour that could be coming from the same problem as in this post:

In the past, I've seen issues like this when a snapshot with the same name is being created on the destination.

Is zfs-auto-snapshot running on the Pool/Global machine too?

Try:

zfs list -o name,guid -t snapshot -r Backup0001/ABCH029/Global/Storage zfs list -o name,guid -t snapshot -r Pool/Global/Storage

The GUIDs should be the same, if there are any that are different, then they were likely created on the destination, and have munged things up on you.

You should not create snapsohts on the destination of the zfs replication.