Fast backup strategy but slow testing

fcorbelli / zpaqfranz

Deduplicating archiver with encryption and paranoid-level tests. Swiss army knife for the serious backup and disaster recovery manager. Ransomware neutralizer. Win/Linux/Unix

MIT License

275 stars 24 forks source link

Fast backup strategy but slow testing #139

Open mirogeorg opened 5 days ago

mirogeorg commented 5 days ago

Hello Franco,

I have the following issue and I'm wondering what the best way to solve it might be. I create large archives, with a new part for each day—arch0001.zpaq, arch0002.zpaq, etc. After completing an archive, I run a test to ensure everything is okay. However, over time, I can't be sure that nothing will get damaged during transport and storage. For this purpose, I could generate CRC/MD5 checksums and transport them along with the files, but this requires reading them again just for the checksums.

After transport, the archives can be easily tested, even on a Synology NAS.

As the volumes grow, the server that tests them may not be able to finish everything within 24 hours, overlapping with the next backup cycle.

Does ZPAQFRANZ have an appropriate option that I can use, for example, to display the CRC for each tested part at the end?

There's a type of ransomware that corrupts files in such a way that it's not obvious at first glance. It damages small parts. You make backups, everything seems calm, but at some point, you find out that everything is corrupted.

Or perhaps you could share your experience, insights, or ideas. How do you test your archives, how often, and how many parallel ZPAQFRANZ processes do you run to reduce the total testing time?

I back up to a network server 120 VMs. I shut down all the VMs, create a VSS snapshot, and then turn them back on. For each VM, I run zpaqfranzhw a \server\share\archive????.zpaq -m2. When archiving, I run 120 parallel copies of ZPAQFRANZ with lowest process and i/o priority. I lower process and i/o priority ot ZPAQFRANZ with external script. Everything done pretty fast. However, thenI test and I do it locally on the server where the archives are stored. I can't run as many parallel processes there, and the testing becomes the bottleneck, which stretches backup window over time, and that worries me.

fcorbelli commented 5 days ago

Of course there already is 😄 -checktxt -fasttxt and the backup command

fcorbelli commented 5 days ago

Let's start by saying that you have a strange system; usually, in this case, I use a ZFS filesystem (to store virtual disks) to take ZFS snapshots and perform backups from there (using the zfsbackup command). Anyway, to get back to the topic, it essentially depends on whether you maintain a single file (i.e., not a multipart) or not. In the case of a single file, the -fasttxt switch generates a text file (_crc32.txt) that contains the CRC32 of the written file (-checktxt creates an MD5). There is a fundamental difference: in the second case (MD5), the file is READ again after being created; in the first case, it is not. The second is intended for multivolume files (then we will see the backup and testbackup commands), while the first is for single files. Let's take an example. Suppose we have a 100GB .zpaq file. Let's say we update it, making it 101GB. With the fasttxt switch, zpaqfranz will read 101GB to create the corresponding MD5 code. This is slow. -checktxt (which, by the way, also stores data inside ADS :) on the other hand... reads nothing. The update time is practically zero (you only pay for the computation time for 1GB). Magic? Well, almost; it took me quite a bit of effort. To recap: using -fasttxt will write (multipart archive) a _crc32.txt file (for the new part). For a single archive, that is, a single-file zpaq, it will UPDATE the text file in virtually no time. Using -chktxt will create an _md5.txt file, reading the files FROM SCRATCH. Using -chktxt -backupxxh3 will also write an _md5.txt file (for backward compatibility reasons) but will use xxh3.

fcorbelli commented 5 days ago

You can automate this procedure using the backup command and conversely testbackup, potentially with -backupxxh3. This is the method I use for multipart backups. Using "heavy" hashers does not work like with -fasttxt, meaning it READS the entire file again to calculate the hash (* perhaps I could explain about a potential FRANZHASH here, which is a quasi-hash, but never mind). At that point, the control over the parts of the backup can be done in various ways, depending on the availability of the password (obviously in cases where they are encrypted). The first method is simply to use testbackup nomedelfile.zpaq. This essentially checks for the presence of all the pieces (which is not a given) and uses the QUICK HASH, which is a fake hash that calculates the hash of the first 64K, the middle ones, and the last 64K. The purpose is to pay attention to the misalignment of rsync. In short, it’s a quick-and-dirty test. It obviously does not require the password.

W:\backup\orecchia\copie>zpaqfranz testbackup nas_orecchia_og
zpaqfranz v60.8k-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2024-10-19)
franz:testbackup                                _ - command
franz:-hw
fixed filename ./nas_orecchia_og
02721$ *** WARNING: original backup path <</share/cloud/copia/>> != current index path <<./>>
02722$ *** I highly recommend using -paranoid! ***
02723$ *** I highly recommend using -paranoid! ***
02724$ *** I highly recommend using -paranoid! ***
====================================================================================================
Multipart backup looks good
Loading backupfile... ./nas_orecchia_og_00000000_backup.txt
Rows in backup 00000039 from 00000001 to 00000039
Enabling XXH3 (in reading) hasher

Quick hash check

GLOBAL SHA256: BA86C3325251440BBDE51F23E601614C30E1AE5CA447381555DF623305EB21DA
Chunks checked OK: 39 (153.525.024.923 @ 2.398.828.514.421/s)
Last date 2024-10-17 20:16:31

Here you will get some useful infos

1) the last backup is done on 17-10-2024 2) It is made of 39 chunks 3) the hash of the hashes is BA86...

fcorbelli commented 5 days ago

If you want to make a "real" re-hashing

zpaqfranz testbackup nas_orecchia_og -verify

with -ssd (multithread on SSDs) if any

If you want to increase the level of testing, you must have the archive password (if it exists). This test essentially compares the index of the .zpaq file with the index recomputed from the various chunks in it (aka: you can't replace a .zpaq chunk even by editing the .txt file by hand)

W:\backup\orecchia\copie>zpaqfranz testbackup nas_orecchia_og -paranoid -quick -ssd
zpaqfranz v60.8k-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2024-10-19)
franz:testbackup                                _ - command
franz:-quick -hw -paranoid -ssd
fixed filename ./nas_orecchia_og
====================================================================================================
Multipart backup looks good
Loading backupfile... ./nas_orecchia_og_00000000_backup.txt
Rows in backup 00000040 from 00000001 to 00000040
Enabling XXH3 (in reading) hasher

Quick hash check
Files             40 -> 016 threads ->             40 to get

GLOBAL SHA256: C856EFF767F3153EECA151583A0267029315DD0D86B88C57694250DB2DEA47A0
Chunks checked OK: 40 (153.526.129.098 @ 1.943.368.722.759/s)
Last date 2024-10-18 20:16:34
----------------------------------------------------------------------------------------------------
00205! Archive seems encrypted (or corrupted)
Enter password :**************

<<./nas_orecchia_og_00000000_backup.index>>:
40 versions, 369.361 files, 99.399.395 bytes (94.79 MB)
02993$ Long filenames (>255)      4.489 *** WARNING *** (suggest -longpath or -fix255 or -flat)

----------------------------------------------------------------------------------------------------

<<./nas_orecchia_og_????????.zpaq>>:
40 versions, 369.361 files, 153.526.129.098 bytes (142.98 GB)
02993$ Long filenames (>255)      4.489 *** WARNING *** (suggest -longpath or -fix255 or -flat)

dtsize                   368.796 [i]               368.796 [c]: OK
usize            293.600.393.007 [i]       293.600.393.007 [c]: OK
allsize          293.600.393.007 [i]       293.600.393.007 [c]: OK
compressed       153.526.129.098 [i]       153.526.129.098 [c]: OK
6.860s (00:00:06,heap 0|array 114.30MB|dt 794.47MB=>908.78MB) (all OK)

In this example (you can mix -paranoid -verify) you'll see (last lines) a match from index and chunks

I keep them encrypted, incidentally, just to make sure that no one can change them, not even ransomware, because it would have to know the password with which they are encrypted. Yes, I understand that's pretty paranoid, but so it's

fcorbelli commented 5 days ago

If you store backups on *nix machines that you can access via SSH, then usually a script similar to this would work:

Just a snippet (but a real-world one)

@echo off
SETLOCAL
SET CWRSYNCHOME=w:\nz\cloud
mkdir %CWRSYNCHOME%\logs >null
del %CWRSYNCHOME%\logs\*.txt >null

REM the key is here (arrrgghhhhh!!!!)
SET PATH=%CWRSYNCHOME%\BIN;%PATH%

%CWRSYNCHOME%\zpaqfranz work datetime -terse >%CWRSYNCHOME%\logs\risultato.txt

%CWRSYNCHOME%\zpaqfranz work big "r interna" -terse >>%CWRSYNCHOME%\logs\risultato.txt
%CWRSYNCHOME%\zpaqfranz sum w:\backup\rambo\rambo_interna\interna\replicata.zpaq -quick -terse >%CWRSYNCHOME%\logs\rambo1_locale.txt
%CWRSYNCHOME%\bin\ssh.exe -p theport -i %CWRSYNCHOME%\thekey root@wherever "/usr/local/bin/zpaqfranz sum /ssd/zroot/interna/replicata.zpaq -quick -terse -noeta" >%CWRSYNCHOME%\logs\rambo1_remoto.txt
%CWRSYNCHOME%\zpaqfranz comparehex %CWRSYNCHOME%\logs\rambo1_remoto.txt %CWRSYNCHOME%\logs\rambo1_locale.txt "QUICK:" 16 -big  >>%CWRSYNCHOME%\logs\risultato.txt
 (...)

%CWRSYNCHOME%\zpaqfranz work big "runmeifyoucan" -terse >>%CWRSYNCHOME%\logs\risultato.txt
%CWRSYNCHOME%\zpaqfranz dir w:\backup\corsa\proxmox\apezzi /on -terse -noeta -n 5 >>%CWRSYNCHOME%\logs\risultato.txt
%CWRSYNCHOME%\bin\ssh.exe -p theport -i %CWRSYNCHOME%\the_key root@wherever "/usr/local/bin/zpaqfranz testbackup /cloud/corsa/copie/apezzi/corsa.zpaq" >%CWRSYNCHOME%\logs\corsa_remoto.txt
%CWRSYNCHOME%\zpaqfranz testbackup w:\backup\corsa\proxmox\apezzi\corsa.zpaq -noeta -big >%CWRSYNCHOME%\logs\corsa_locale.txt
%CWRSYNCHOME%\bin\ssh.exe -p theport -i %CWRSYNCHOME%\thekey root@wherever "ls -l /cloud/corsa/copie/apezzi/corsa** |tail -n 5" >>%CWRSYNCHOME%\logs\risultato.txt
%CWRSYNCHOME%\zpaqfranz comparehex %CWRSYNCHOME%\logs\corsa_remoto.txt %CWRSYNCHOME%\logs\corsa_locale.txt "GLOBAL SHA256:" 64 -big  >>%CWRSYNCHOME%\logs\risultato.txt

(...)
%CWRSYNCHOME%\zpaqfranz work printbar "=" -terse >>%CWRSYNCHOME%\logs\risultato.txt
%CWRSYNCHOME%\zpaqfranz work datetime -terse >>%CWRSYNCHOME%\logs\risultato.txt
%CWRSYNCHOME%\zpaqfranz work big "conta gli ok" -terse >>%CWRSYNCHOME%\logs\risultato.txt

%CWRSYNCHOME%\zpaqfranz count %CWRSYNCHOME%\logs\risultato.txt 8 -big >>%CWRSYNCHOME%\logs\risultato.txt

if not errorlevel 1 goto va
if errorlevel 1 goto nonva

:nonva
%CWRSYNCHOME%\mailsend -t whereiwanto -f whoknoes -starttls -port 587 -auth -smtp smtp.something -sub "*** ERRORE *** CASA KAPUTT" -user myuser -pass pippo -mime-type "text/plain" -disposition "inline" -attach "%CWRSYNCHOME%\logs\risultato.txt"
goto fine

:va
%CWRSYNCHOME%\mailsend -t whereiwantto -f whoknows-starttls -port 587 -auth -smtp smtp.something -sub "CASA OK" -user myyser -pass pippo -mime-type "text/plain" -disposition "inline" -attach "%CWRSYNCHOME%\logs\risultato.txt"
:fine

fcorbelli commented 5 days ago

OK, this seems a bit complex 😄

Steps: 1) download the remote files to local (usually rsync) 2) list the last 5 local files (looking for holes) 3) run, via ssh.exe, zpaqfranz testbackup on the remote server. Get the data (aka: the global hash) 4) do a local zpaqfranz testbackup. Get the data (aka the global hash) 5) comparehex (of 3 and 4) 6) (repeat) 7) redirect everything on .txt files 8) zpaqfranz count the expected OK 9) mail the logs to different e-mail. Storage (if everything OK) or "real" (in case of errors)

fcorbelli commented 5 days ago

Then there is the version when you can't run zpaqfranz remotely (e.g., hetzner storagebox), the whole file version (not multipart), and so on. Some time ago, maybe I mentioned it, I had posted the whole thing on the FreeBSD forum (thousands worth-equivalent). Then I argued with a user and deleted everything

Incidentally if you have different folders,within a master folder, there is already switch -home, to split the added files into different files.

/data/debian
/data/freebsd
/data/ubuntu

zpaqfranz /cloud/thebackup /data -home

fcorbelli commented 5 days ago

Storagebox (upload via Linux)

fcorbelli commented 5 days ago

Keep in mind that the test should eventually include a full recoverability test. You can perform a relatively quick one using the t (test) command. A truly serious test involves using -paranoid, which means a complete extraction. This—besides taking time, which you can plan in stages—puts significant pressure on where you write the data. In this case, I use expendable SSDs (i.e., the cheapest, standard Samsung EVO ones) that I set up in striping (ZFS RAID-0), using 2 or 4 at a time. Then I enable deduplication and keep the base volumes.

If you extract 100GB from a virtual machine, this will write 100GB to the disks. Do this day in and day out, and the wear becomes significant. So, I keep 100GB of virtual machine data in a ZFS volume with deduplication. When I extract the 100GB (from the zpaq archive) and write it, ZFS will only actually write the blocks that differ from the original 100GB, significantly reducing the amount of data written. Then, every now and then, I swap the base folders.

Instead, when we talk about backing up very large virtual machines on systems without ZFS (i.e., ESXi, to be clear), I use a FreeBSD virtual machine (TrueNAS) with deduplication, which exposes the data folders via NFS. I perform the backups (from the hypervisor) inside the virtual machine, which will deduplicate the data on its own (rather quickly). We’re still talking about SSDs, of course. Then, "overnight," I run a zpaqfranz that archives the VMDKs from the TrueNAS machine at its own pace. I use a desktop machine with a very fast AMD CPU and a 10GB network card that then writes the archived data to a NAS. So, the TrueNAS VM will have something like /mnt/backup/copia_01, /mnt/backup/copia_02, /mnt/backup/copia_03 for the various days, while the "CPU cruncher" gradually compresses them at its own pace. This also helps to minimize the restore time. In the TrueNAS machine (which is physically within the NFS share of a NAS equipped with 7TB SSDs each), the virtual disks are stored "in clear," meaning inside folders. In case of an emergency, I can use another ESXi machine to mount the TrueNAS NFS (e.g., /mnt/backup/copia_02) and boot directly from there, essentially in no time. Extracting hundreds of GB from zpaq can indeed take a long time. On ZFS systems, I obviously use other measures (like incremental backups).

fcorbelli commented 5 days ago

Enough, or do you want further "spiegone" ?

fcorbelli commented 5 days ago

Let’s add a few more details. If you are working with unencrypted multivolume files that were NOT created with -chunk, THEN the individual pieces are perfectly independent. This means you can test their contents independently of the previous parts and also extract them (in addition to listing them). This obviously applies to files, not to virtual disks (which are too large).

If you have a file server and archive the files in different pieces (backup001.zpaq, backup002.zpaq, backup003.zpaq...), you can perform a paranoid test on the LAST piece (for example, backup003.zpaq), extracting all its contents (I use a RAMDISK). Instead of checking every single file continuously (you'll usually have a large stock of files and then small daily additions), you can do it for the added data.

Paranoid test of the last piece

W:\nz\cloud>zpaqfranz t z:\okane_0002.zpaq -paranoid -to z:\unga2 -ssd
zpaqfranz v60.7p-JIT-GUI-L,HW SHA1/2,SFX64 v55.1,(2024-10-02)
franz:-to                   <<z:/unga2>>
franz:-hw -paranoid -ssd

<<z:/okane_0002.zpaq>>:
1 versions, 7.064 files, 1.429.352.920 bytes (1.33 GB)
Extract 1.452.110.206 bytes (1.35 GB) in 7.064 files (0 folders) / 16 T
        68.00% 00:00:00  ( 941.72 MB)=>(   1.35 GB)  470.86 MB/s

FULL-extract hashing check (aka:paranoid)

Total bytes                  1.452.110.206 (should be 1.452.110.206)
Bytes checked                1.452.110.206 (should be 1.452.110.206)
Files to be checked                  7.063
Files ==                             7.063 (should be 7.063)
Files !=                                 0 (should be zero)
Files deleted                        7.063 (should be 7.063)
RAM: heap 507.13 MB +array 641.92 KB +files 7.52 MB = 515.27 MB
4.719 seconds (00:00:04) (all OK)

Insted of (paranoid test of everything, just an example)

W:\nz\cloud>zpaqfranz t z:\okane_????.zpaq -paranoid -to z:\tutto -ssd
zpaqfranz v60.7p-JIT-GUI-L,HW SHA1/2,SFX64 v55.1,(2024-10-02)
franz:-to                   <<z:/tutto>>
franz:-hw -paranoid -ssd

<<z:/okane_????.zpaq>>:
2 versions, 10.565 files, 2.350.071.345 bytes (2.19 GB)
Extract 3.316.910.637 bytes (3.09 GB) in 10.565 files (0 folders) / 16 T
        77.39% 00:00:00  (   2.39 GB)=>(   3.09 GB)  815.98 MB/s

FULL-extract hashing check (aka:paranoid)

Total bytes                  3.316.910.637 (should be 3.316.910.637)
Bytes checked                3.316.910.637 (should be 3.316.910.637)
Files to be checked                 10.210
Files ==                            10.210 (should be 10.210)
Files !=                                 0 (should be zero)
Files deleted                       10.210 (should be 10.210)
RAM: heap 505.16 MB +array 0.00  B +files 11.24 MB = 516.40 MB
7.969 seconds (00:00:07) (all OK)

fcorbelli commented 4 days ago

I am writing two different hashes. zeta and zetaenc (!).

Essentially in the next version of the backup command it will be possible to store the ZETA hashes (which are basically xxhash64 minus 104 bytes) and the CRC-32 of the individual chunks WITHOUT having to re-read them from disk.

Essentially finished compression, the data is written and bon (admittedly with a little slowdown due to internal computations).

I don't think you can do anything faster than that.

Unfortunately, the 104 bytes cannot be computed in the hash, but they are in there in CRC-32.

Basically you can use any program to compare the CRC-32 with the .zpaq chunks you create, whereas for using a hash you have to use zpaqfranz.

At present I use xxhash64, but I am undecided whether to go for something really “heavy” (e.g., SHA256). I am also thinking of keeping 104 bytes here as well, albeit out of order, to permanently prevent any kind of modification. However, this would be a problem for NAS, with anemic CPUs

Is that enough for you?

fcorbelli commented 3 days ago

You can try this one (not finished, does not work with encryption etc) 60_8l.zip

Make something like that

zpaqfranz backup z:\uno.zpaq c:\nz -backupzeta
zpaqfranz backup z:\uno.zpaq c:\ut -backupzeta
zpaqfranz backup z:\uno.zpaq c:\1200 -backupzeta

Then...

zpaqfranz testbackup z:\uno.zpaq -verify

mirogeorg commented 3 days ago

Tested. Looks good. Then tested with zpaqfranz testbackup but without -verify. is this expected behavior?

mirogeorg commented 3 days ago

If I understand correctly, the simplified backup scheme should be as follows: ZPAQFRANZ backup archive -backupzeta

After the process is complete, I can test the created archive with something like: ZPAQFRANZ t archive_????????

Once tested, if I'm not adding new parts, in the future I can test it only with: ZPAQFRANZ testbackup archive -verify to simply check if the hashes match, without running the slower test.

Did I understand correctly?

fcorbelli commented 3 days ago

Spiegone is coming 😄 First of all update to latest (60.8n)

The backup command will create a regular multipart archive (with pattern ????????), with a separate index file, and a .txt file (with the hashes inside) By default QUICK hash (xxhash of first 64KB+ mid 64K + last 64K) and MD5 hashes (=> a full re-read of the created archive just after done)

With -backupxxh3 => XXH3 hashes (=> full re-read) With -backupzeta => ZETA hashes (NO further read from filesystem)

You can do whatever you want, just use the 8 ? pattern, or hope for the heuristic (for example zpaqfranz t archive will work) List, extract, merge, test, whatever. It is a plain zpaq

Then

zpaqfranz testbackup z:\okane

will run QUICK hash checks. AKA: no holes (missing part). Size matches. QUICK hashes are OK

If you add -verify a full re-read will enforced, with the used hash (MD5, XXH3 or ZETA) A -paranoid will compare index with rebuilded dtmap

You get even the hash of hash (GLOBAL SHA256) If two zpaqfranz's backup get the very same GLOBAL SHA256 (after testbackup) they are =

Do you want more "spiegone"??

fcorbelli commented 3 days ago

Recap. The quick hash (which you can calculate with the sum command and -quick) instead of reading the ENTIRE file (it does if it’s shorter than 64KB) performs a sort of "sample test" on the initial, middle, and final parts. This serves to quickly detect rsync errors with --append, that is, when the file length is correct, but the final part is corrupted. It is not, obviously, a reliable hash. On the other hand, it’s very fast even on large files and serves its purpose. Without parameters, this one is used (i.e., quick hashes). With the -verify command (which also works multithreaded with -ssd), a full hash rebuild is launched: MD5 (for backward compatibility) or XXH3 (-backupxxh3). The latter is five times faster and is normally preferred on modern machines. The new -backupzeta switch uses the new ZETA hash (in the future also ZETAENC). What is it? It’s an XXHASH64 that skips the first 104 bytes and takes the rest (in the case of ZETAENC, also the first 32, which is the encryption salt), plus the FULL CRC-32 of the file. These data are NOT re-read from the filesystem; after writing the file okane_0000034.zpaq, it is NOT re-read but calculated during the creation of the new archive (obviously, it slows down a bit, around 1 second per gigabyte roughly). When you run testbackup -verify (with the ZETA hash), it will then check that the entire file (except for the first 104 bytes) is identical in hash to the expected one (soon it will also verify that the CRC-32 of the file is correct). I could also add those 104 bytes, but it’s complicated, so I’ve left it out. If you’re curious about the why and how, it depends on the storage format of .zpaq files, which is the jidacheader. Currently, it doesn’t work with encrypted archives (though I’m optimistic about making it work, it’s just a matter of work). You can see the real CRC-32 (i.e., of the entire file) inside the .txt file in the part that begins with 4 w

$zpaqfranz backupfile|2|ZETA|2024-10-21 13:44:44|z:/1_????????.zpaq
zzzz8fb183c1125eed6dwwww17fae82e |[                9.742] <985455BE95A41D4B> $2024-10-21 13:44:44$ z:/1_00000001.zpaq

In this example the file 1_00000001.zpaq get CRC-32 17fae82e, QUICK 985455BE95A41D4B and XXHASH64 (except the first 104 bytes) 8fb183c1125eed6d

fcorbelli commented 3 days ago

As mentioned, I might change the function to also include the 104 bytes, just not in the order of XXHASH64. It depends on how much I feel like doing it. So, keep in mind that the .txt files generated (today) might not be compatible with future ones.

mirogeorg commented 2 days ago

OK it's clearer now. I'll switch to backup -backupzeta and testbackup -verify