Best practices: unit tests, replication, upgrades, and backups

sonaro commented 5 years ago

A few questions came up during a discussion on how to upgrade the software running on the controllers in a way that ensures compatibility and consistency over time. This snowballed into questions regarding various edge conditions and best practices to handle them. Some of these are edge conditions that may not be likely encountered, or only a concern for a system under much higher load / more critical, although listed for completeness.

Unit / stress tests Has anyone created unit / stress tests to verify end-to-end data consistency during various failover conditions? As an example, consistency verification when a controller hard power-offs or crashes without cleanly exporting while write operations are taking place.

Replication Any best practices for most efficient replication between the 2 shelves for the ( 2 Node + 2 Shelf ) configuration? What happens if there is a failure during replication?

Software upgrades Any best practices for software upgrades on controllers? For instance: (1) upgrade controllers at the same time, (2) upgrade the hot-standby and test for a period of time, (3) run a completely independent backup stack (2 active replicating nodes + 1 backup node) that is read-only and test updates on the second stack first? Any known gotchas for new ZFS versions for the stable branch, or best practices to mitigate possible version conflicts? For example: backup and/or hot-standby stack receives updates, then an error occurs during replication via zfs send (silent or reported error). A unit / stress test might be able to help catch these kinds of issues before they get deployed to the active controller.

Isolated Backups / Fast Restoration Any best practices for separate, isolated backups (zfs send, out-of-band, etc) running on a separate stack that is immune to any kind of stack-specific inconsistency/corruption? Ultimately, could we manually make this third stack adopt the floating virtual IP, make read-write and takeover as master, then rebuild/repair the original master and assign as backup? Possibly a better way to handle systemic issues?

Other Edge Conditions Are there other edge conditions not covered here that we should keep in mind, while in pursuit of five nines availability and eleven nines durability?

PS- this is a great resource! Keep up the great work!

efschu commented 5 years ago

For your "Unit / stress tests" you can use ATTO Disk Benchmark and let it verify its data, then make your controlles crash on all possible ways - but as we all use sync=always (dont we ;-) ), you will never see any inconsistency.

tullis commented 3 years ago

I would also be keen to hear about any recommended ZFS backup strategies for this type of system please.

If performing an zfs send on the active host, would an running backup prevent timely unmounting and failover of the zpool resource to the standby host?

Should I be performing file-level backup from an NFS client instead, for safety?

ewwhite / zfs-ha

Best practices: unit tests, replication, upgrades, and backups #28