darkstar62 / backup

Backup Program
BSD 3-Clause "New" or "Revised" License
1 stars 0 forks source link

Implement file checksumming #6

Open darkstar62 opened 12 years ago

darkstar62 commented 12 years ago

Ideally, file checksumming and chunk checksumming should be common to all storage backends. So, there should be functions in the base classes able to assist with checksumming, so we're not implementing this ten times.

darkstar62 commented 12 years ago

So after thinking about this a bit, I've come to the conclusion that using serialized ICE objects to contain the hash chunks of every file in the backup is not scalable, mainly because the whole thing would have to be loaded into memory to be referenceable. Large backups will easily run the server out of memory.

This could be mitigated by varying the size of the chunks, effectively putting an upper-bound on the number of chunks per file. This still falls over though for backups with a huge number of files.

Best would be to do the above, but store it in a SQLite database instead. This way, the hashes can be dealt with on-demand, rather than loading the whole thing into memory first. This also keeps the memory usage during backups down as well, since the hashes can be written to disk as the backup progresses.

darkstar62 commented 12 years ago

There's the question of which hashing algorithm to use. MD5 has been shown to be pretty weak to collisions, but algorithms in SHA-2 are very strong. A combination of several different algorithms would with high probability indicate sameness if all hashes match. However, the more hashing algorithms that are involved, the slower things will run.

MD5+SHA512 are probably good enough, plus SHA512 tends to run faster on 64-bit computers.

darkstar62 commented 12 years ago

The implementation of this bug for BTRFS, according to the design, means the (at least partial) implementation of the backup scanner, as it computes checksums for new backups (for new files). It will be invoked immediately after a backup completes to checksum the files, making down time minimal. Restores should still be allowed during this time, but no new incremental backups can be made until the checksumming finishes.