bolthole / zrep

ZREP ZFS based replication and failover script from bolthole.com
Other
251 stars 57 forks source link

Failed init with resume fails and partial send is destroyed #149

Open darkpixel opened 4 years ago

darkpixel commented 4 years ago
root@usrbgofnas01:~# zrep -t zrep-remote init tank/pdata uswuxsdrtr01--redacted-- tank/backups/usrbgof/pdata
Setting zrep properties on tank/pdata
Creating snapshot tank/pdata@zrep-remote_000000
Sending initial replication stream to uswuxsdrtr01--redacted--:tank/backups/usrbgof/pdata
 169GiB 90:26:04 [ 705KiB/s] [                                <=>                                                                                                                                                                                                                                                                                                        ] 
packet_write_wait: Connection to my.ip.add.res port 221: Broken pipe                                                                                                                                                 <=>                                                                                                                                                ] 
aaron  ~  255  ssh root@usrbgofrtr01.--redacted-- -p 221

root@usrbgofnas01:~# cat /usr/local/etc/zrep.env
export SSH="ssh -p 225"
export ZREP_SEND_FLAGS="--raw"
export ZREP_RESUME=yes
export ZREP_R=-R
export ZREP_INC_FLAG=-i
export ZREP_OUTFILTER="pv -eIrabL 800K"
root@usrbgofnas01:~# source /usr/local/etc/zrep.env
root@usrbgofnas01:~# zrep -t zrep-remote init tank/pdata uswuxsdrtr01--redacted-- tank/backups/usrbgof/pdata
tank/pdata is at least partially configured by zrep
Partially complete init detected. Attempting to resume send
cannot receive incremental stream: incompatible embedded data stream feature with encrypted receive.
3.05MiB [ 395KiB/s] [ 395KiB/s] 
Error: resume send of zrep init tank/pdata failed
root@usrbgofnas01:~# 

I wasn't able to see the command it ran, but my guess based on the error message is that it was missing the --raw flag from the zfs send command.

ppbrown commented 4 years ago

Huh. only just saw this one.

erm.. isnt the resume thing supposed to set all the flags automaticaly, below the zrep level? is this actualy a ZFS bug?

darkpixel commented 4 years ago

ZFS doesn't transfer the children "in bulk". It transfers them individually and the resume flag is set for each individual transfer. So if you have:

tank
tank/virt
tank/virt/vm-100-disk-0
tank/virt/vm-100-disk-1
tank/virt/vm/101-disk-0

And it sends vm-100-disk-0 and them bombs out on 100-disk-1, the next time zrep runs against tank/virt it will bomb out because it's not checking to see if the previous transfers completed successfully and it has no idea bout the 'remote state'.

Syncoid has a slightly different process. It appears to:

But this bug was more about zrep not sending the resume flag during a resume. I'm not sure if it's fixed in the latest version, but it appears when it detects an interrupted resume, it tries to fix it but fails to send the resume flag. Then when it bombs out (since the resume flag wasn't passed) the remote deletes the entire dataset.