Open Intensity opened 7 years ago
@Intensity That's an interesting suggestion and I appreciate the clear formulation. It would help feel confident in the later restore process.
My simpler take on this is simply this: for a given backup/script pair, before backing up actual data, I echo "ok"
to the backup script and check I get "ok" back by the restore script. I trust that if it works with 2 characters, it works with terabytes too.
For added peace of mind, your suggestion seems ideal. And it does take into account the streaming nature (unknown size). However, I see two problems at least:
1) I don't yet see how to reliably infer the reverse from a given proc string.
2)
One method of validating is to "trust" a query to the cloud provider for the checksum of the block (if this information is available through their API)
I guess the only info the provider would give is the checksum of the uploaded file, not the checksum of the resulting chunk from processing several uploaded files (such as data+parity shards). Meaning we would have to download the files anyway. We could only check the integrity of the uploaded file by comparing the checksum contained in the filename to theirs. But then we would have a special case for the rclone
proc (if rclone allows retrieving provider checksums), as cp
(local fs), etc... don't have this.
For most common proc strings, make it possible to run a simultaneous "counterpart" process that "proves" that the operation's reverse will succeed - inline.
Taking one operation in isolation to demonstrate (compression), the idea is to run the reverse process as the data is still being ingested to confirm that there won't be any issues with a future restore. Suppose a compression algorithm has a bug where input can't be decompressed properly (maybe it leads to a crash), and this isn't known until decompression time. This would catch that; also, in the worst case, one could keep around a copy of the scat binary and supporting libraries/programs to ensure that reconstruction is possible even amidst potential changes and updates of underlying libraries.
The compression validation would involve decompression and confirmation that the result equals the original input (after validation succeeds, the decompressed block can be discarded and the next block of the stream processed). A validation phase might "lag" behind the input, but arbitrarily sized input streams can be practically processed without requiring the entire stream to first be written in its entirety before validation starts. For example, to confirm that terabytes of input can be successfully decompressed, one should not need to first write the entire result before validating.
Similar validation can be put in place for encryption (the reverse being decryption, whether or not a symmetric or hybrid cryptosystem is in use). Likewise, ECC/parity can confirm that redundantly expanded data can be restored as the original when run in reverse. Deduplication can also be validated.
I wanted to especially highlight storage/rclone validation. When I write to a remote cloud provider, I'd like to have a separate thread reconfirm that the data validated is exactly the same as the blocks that have just been written. One method of validating is to "trust" a query to the cloud provider for the checksum of the block (if this information is available through their API); another approach involves fetching the payload in another thread and validating that it's returned as expected. Making it a separate thread and as part of an independent API query could avoid overly "trusting" a local cache. Without validation of this kind, there could be potential silent information loss.
At the top I suggested "make it possible", and by that, I mean even if it requires the user to custom tweak the chain construction to include this validation. In principle, the validating mode of operation could be automatically constructed from common pipeline usage patterns. If implemented, this would offer unique assurance that the data can be reconstructed.