Closed JacksonWrath closed 5 months ago
Was delayed due to bad drives and corruption, which have been resolved (#22).
I was able to replicate new snapshots without re-replicating everything. The hiccup was the new parent dataset; since it never had a snapshot, TrueNAS / ZFS seemingly didn't know how to reconcile that with the child datasets that did have snapshots; it kept complaining that there were no snapshots but had existing data.
I was able to get past this by manually replicating the snapshots of each child dataset separately, then enabling "replication from scratch" in the TrueNAS replication job for replicating the parent dataset. It then replicated the effectively empty parent dataset snapshot, and discovered the snapshots of the child datasets. Job completed in seconds, and I confirmed expected new data was indeed there.
Additional hiccup I didn't think about; I modified the replicated VyOS VM without changing the name/zvol when I set up VRRP. The snapshot was then replicated over that, and ended up corrupting the zvol since the VM was running (I think that's why. Whatever the technical reason, it wouldn't boot anymore). Oops.
I was thankfully smart enough to backup the configs, so I just created a new unique-named VM on Asuna and restored the config (with hw-ids of the interfaces updated). Then did the same for Kirito so they don't conflict in the future, and deleted the original VMs.
Hm okay so, the permanent corruption from (#22) persists, because even though I overwrote the corrupted files from the (good as of May 5th) files on Kirito, the snapshot replication rolled that back. Fair enough.
Normally, the only 2 [supported] paths forward would be:
There's some talk I found that theorized using zdb to find the specific blocks that are corrupted and dd-ing them back to health, but I really don't feel like futzing with that. I see a potentially cleaner path...
I want to make Asuna the primary anyway. Since the data isn't actively changing, I might be able to:
That should leave everything on the same baseline snapshot still. Let's find out.
Seems to have worked as expected. Job status confirmed it was replicating about 8GB of data, which matches with the corrupt files I backed up.
I stopped short of deleting the old snapshots, and kicked off a scrub on Kirito, just to confirm. That takes about 10 hours to complete (unfortunately there's no way to partial scrub a pool, other than already known errors).
Scrub on Kirito is clean. Looks like that did the trick. Interestingly though, it resolved the data errors on Asuna without having to delete the snapshots. I’m not entirely sure why honestly; the first time I overwrote the corrupt files it only fixed the errors on the live dataset, not the snapshots.
I did already delete the original snapshots that were affected, and the “resurfaced” errors I mentioned earlier were on the newer snapshots that were replicated. Maybe that logic of propagating errors through new replicated snapshots is broken? That’s also speculation at best though.
Either way, I’m deleting those snapshots, just to be sure. I don’t need 3 snapshots within 6 hours of each other.
Finally, snapshots and replication is set up, along with email alerting.
Snapshots are daily, immediately replicated, and retained for 6 weeks, since scrubs occur every 5 weeks.
The VMs dataset is excluded, since the only thing still changing is VyOS and there’s already a separate instance of that on each server. All I really need there is the config files.
Post-closure update:
When I set this up, I told it to just do the whole pool, forgetting there's hidden datasets on the pool create by TrueNAS, but TrueNAS doesn't exclude them because TrueNAS isn't that smart. However, the error surfaced was the same as before: that there wasn't a base incremental snapshot to work from on the main pool, and replication from scratch wasn't allowed.
In a moment of stupidity, I checked "Replication from scratch", thinking it just needed the first snapshots replicated, but I forgot to uncheck "recursive". That obliterated everything on the Kirito pool, except the VyOS VM (because it was running) and the hidden datasets (because they were in use by TrueNAS*). Oops.
So, I set it up to replicate the child datasets separately, and re-copied everything. It discovered one additional corruption in this process, which thankfully I had a backup of so I could restore it. I then did the song-and-dance of replicating each dataset separately, since it had gotten through a few of them before hitting that.
In the end, I got everything re-replicated, on the same snapshot, and the snapshot + replication jobs finally succeeded on their own.
*(It's still up in the air if when I reboot TrueNAS will be borked, but I can at least just re-install, restore the config, and import the pool)
Most of the data is still there, but the snapshot structure doesn't match; I discovered that TrueNAS puts some hidden datasets at the root, which you can't see (and can't filter in the UI, stupidly), but replication tasks will still try to replicate them (and fail because they exist on the other side already).
I had to create a new child dataset on the target that contains all of my original datasets, which I did before I knew you could move/rename a dataset. I assumed those were basically immutable, like most ZFS shit tends to be. I moved the datasets on Kirito to match Asuna after I discovered this.
It's unclear if when I replicate back, it will recognize the snapshots, or decide that because the name changed, the history is no longer shared.