Closed mirogeorg closed 3 years ago
The dedup algo is very fast, very well implemented by mr. Mahoney. Using a faster one (for example BLAKE3 with hardware acceleration, XXH3, even SHA-1 with hardware acceleration on AMD Ryzen) get only limited benefits (~10%), not worth the broken compatibility.
The deduplication stage takes little time, in fact, in the overall time. For big files the major problem is the bandwith: reading back 10TB will take a long time (even 400GB). zpaqfranz can use multithread hashing with the sum() command (the hasher) and -all switch, way faster with solid state drives (on my PC up to 17GB/s).
But, reading a single file, it cannot be done. Or, better, I have to think on it.
Check your (max) SHA-1 speed with the b (benchmark command) with -sha1
zpaqfranz b -sha1
On my PC it is more than 900MB/s, faster than SSD bandwith (not for NVMe)
My setup is AMD 5950X, 128GB RAM, 980PRO NVMe. copied one 221MB file 160 times. SUm of all files is ~33GB.
All files are cached in RAM so there is no read access during add. CPU is constantly busy at 5% at 4.60-4.70GHz, HDD 0%
Here is some output:
D:\@@@>zpaqfranz a \bb -threads 32 zpaqfranz v54.6-experimental (HW BLAKE3), SFX64 v52.15, compiled Sep 18 2021 Integrity check type: XXHASH64+CRC-32 + CRC-32 by fragments /bb.zpaq: 1 versions, 8 files, 2.924 fragments, 19.012.625 bytes (18.13 MB) Updating /bb.zpaq at offset 19.012.625 + 0 Adding 33.912.015.680 (31.58 GB) in 160 files at 2021-09-21 13:44:20 6.25% 00:03:05 ( 1.97 GB) of ( 31.58 GB) 168.44 MB/sec D:\@@@>zpaqfranz a \bb -threads 32 -all zpaqfranz v54.6-experimental (HW BLAKE3), SFX64 v52.15, compiled Sep 18 2021 Integrity check type: XXHASH64+CRC-32 + CRC-32 by fragments /bb.zpaq: 1 versions, 8 files, 2.924 fragments, 19.012.625 bytes (18.13 MB) Updating /bb.zpaq at offset 19.012.625 + 0 Adding 35.607.616.464 (33.16 GB) in 168 files at 2021-09-21 13:44:37 100.00% 00:00:00 ( 33.16 GB) of ( 33.16 GB) 156.49 MB/sec 176 +added, 9 -removed.
19.012.625 + (35.607.616.464 -> 0 -> 1.022.929) = 20.035.554
222.156 seconds (000:03:42) (all OK)
D:\@@@>zpaqfranz a \bb * -threads 32 -force zpaqfranz v54.6-experimental (HW BLAKE3), SFX64 v52.15, compiled Sep 18 2021 Integrity check type: XXHASH64+CRC-32 + CRC-32 by fragments /bb.zpaq: 2 versions, 184 files, 2.924 fragments, 20.035.554 bytes (19.11 MB) Updating /bb.zpaq at offset 20.035.554 + 0 Adding 35.607.616.464 (33.16 GB) in 168 files at 2021-09-21 13:51:54 100.00% 00:00:00 ( 33.16 GB) of ( 33.16 GB) 156.49 MB/sec 176 +added, 186 -removed.
20.035.554 + (35.607.616.464 -> 0 -> 1.035.636) = 21.071.190
221.937 seconds (000:03:41) (all OK)
D:\@@@>zpaqfranz b zpaqfranz v54.6-experimental (HW BLAKE3), SFX64 v52.15, compiled Sep 18 2021 Benchmarks: XXHASH64 XXH3 SHA-1 SHA-256 BLAKE3 CRC-32 CRC-32C WYHASH WHIRLPOOL MD5 SHA-3 Time limit 5 s (-n X) Chunks of 390.62 KB (-minsize Y)
00000005 s XXHASH64: speed ( 5.93 GB/s) 00000005 s XXH3: speed ( 6.69 GB/s) 00000005 s SHA-1: speed ( 898.74 MB/s) 00000005 s SHA-256: speed ( 223.31 MB/s) CPU feature 001F 00000005 s BLAKE3: speed ( 3.47 GB/s) 00000005 s CRC-32: speed ( 8.87 GB/s) 00000005 s CRC-32C: speed ( 7.05 GB/s) 00000005 s WYHASH: speed ( 8.44 GB/s) 00000005 s WHIRLPOOL: speed ( 182.65 MB/s) 00000005 s MD5: speed ( 822.45 MB/s) 00000005 s SHA-3: speed ( 435.18 MB/s) Results
WHIRLPOOL: 182.65 MB/s (done 913.24 MB) SHA-256: 223.31 MB/s (done 1.09 GB) SHA-3: 435.18 MB/s (done 2.12 GB) MD5: 822.45 MB/s (done 4.02 GB) SHA-1: 898.74 MB/s (done 4.39 GB) BLAKE3: 3.47 GB/s (done 17.38 GB) XXHASH64: 5.93 GB/s (done 29.57 GB) XXH3: 6.69 GB/s (done 33.35 GB) CRC-32C: 7.05 GB/s (done 35.13 GB) WYHASH: 8.44 GB/s (done 42.08 GB) CRC-32: 8.87 GB/s (done 44.20 GB)
55.031 seconds (000:00:55) (all OK)
D:\@@@>zpaqfranz x \aa -all -force zpaqfranz v54.6-experimental (HW BLAKE3), SFX64 v52.15, compiled Sep 18 2021 /aa.zpaq: 2 versions, 81.308 files, 291.801 fragments, 2.254.200.954 bytes (2.10 GB) Non-latin (UTF-8) 81 Extracting 49.202.038.464 bytes (45.82 GB) in 81.294 files -threads 32 98.41% 00:00:00 ( 45.09 GB) of ( 45.82 GB) 756.97 MB/sec 74.422 seconds (000:01:14) (all OK)
I have just the same configuration, and as you can see the SHA-1 is running @ about 900MB/s. zpaq deduplicate in a single thread (I am working on it to become multithread, but it is not so easy). Read all the file, 4K at time, then calculate SHA-1 and do a lot of things. The problem is that you need to read the entire file from the media. If the file is huge (ex a vmdk) you will get a maximum speed of about 900MB/s (the SHA-1) in the deduplication stage. If you use a spinning drive (or SSD) you will have no bottleneck in this case (about 150MB/s from disk, 500MB/s from SSD)
If the files are small (say thousands of .DOC) then the multithread can benefit. You can check yourself, try
zpaqfranz sum d:\something -sha1 -summary
and
zpaqfranz sum d:\something -sha1 -summary -all
So yes, a multithread deduplicator will be better on small files, a fast NVMe and many CPU, but, in fact, not much better (total time).
To speed up a lot of work is needed, not only a faster deduplicator
About t (test): there are two stages. In the first (as 7.15) the check is done on the SHA-1-stored In the second (zpaqfranz) a CRC-32 (much faster) runs to detect SHA-1 collisions. If you test against a directory you will get the max speed of SHA-1 (in fact you are re-reading in variable-sized chunks from disk, calc SHA1, compare with stored). This is in fact fast, very fast, for slow media.
For much faster one (ex. xxhash64 or XXH3) the v (verify) command run much faster, but it is a check-against-the-filesystem and not a archive-integrity-check (you need something, the original files, online)
zpaqfranz a z:\1.zpaq c:\dropbox\dropbox
will create the z:\1.zpaq, with xxhash64 (by default on zpaqfranz)
running
zpaqfranz v z:\1.zpaq
a single-threaded verify against filesystem for xxhash64 will run
Note: if you are paranoid you can do
zpaqfranz a z:\1.zpaq c:\dropbox\dropbox -sha3
or -sha2, or blake3, or whatever
OK, then
zpaqfranz t z:\1.zpaq
will do a integrity file check (as said two stages, first as 7.15 and second for collision) but
zpaqfranz t z:\1.zpaq c:\dropbox\dropbx
will invoke the SHA1-chunked verify against the filesytem (something similar to 7.15)
If you are really paranoid
zpaqfranz p z:\1.zpaq
and more
zpaqfranz p z:\1.zpaq -verify
Thank you. Now I understand the problem better. Multithreaded reading of course is limiting too, but most other utilities also has this limitation, so it's not so obvious. It's not ZPAQ specific problem...
Slow adding seems most limiting for me recently. Your idea for multithreaded reading from disk is cool too, but seems complex to implement.
Please check if "slow adding" is by... reading (from task manager). If zpaq/zpaqfranz, during add of something big (.vmdk etc), read constantly by (for example) 400MB/s, so it is mainly a media-bandwidth limitation (non cache here). If read @ 20MB/s (just an example) something weird is running.
Adding require re-reading everything from the filesystem, hashing, then "do the rest". With vmdks, for example, is rather normal to get 1 hour of about nothing (read...read...read...) and maybe 5 minute of writing on the archive.
I'll add a timer for the dedup stage, with something like "starting dedup stage"... "ended dedup in 2000 s, let's do something..."
Verified this, problem is not with slow reading and/or writing. 10TB source data is stored on 3x870QVO 8TB (RAID0). During add/reread they sit mostly idle. Their cumulative read speed is well above 1GB/s, up to 1.4GB/s. Destination media are HDDs that can absorb writes at about 350mb/s. They are busy up to 20% during initial add and mostly idle afterwards. In my case on Core I7-6700K add rate is about 105-110mb/s. To add 10TB in 24h cumulative speed need to be at least 120mb/s.
Initial add is about 35hours and CPU is busy. On 2nd add, 3rd add and so on CPU sits mostly idle during deduplicaton - 1 core is used because most of the data already in archive.
Using decent Xeon machines with about 1GB/s real bandwith I get ~500GB/hour for .vmdk updating, so ~10TB for day, or ~110MB/s sustained.
To get more speed I use more than one process in parallel (NVMe drive on zfs, so no problem of latency with concurrent access), for different virtual machines, capped by -t2 (no more than 2 threads each), so I can run 2 to 8 (different hardware) updates
As stated I will embed some (optional) profiling to see where the software consume the time
Ok, I will test this workaround - more than one process in parallel for different virtual machines, capped by -t2
Run a multithreaded deduplicator will degrade dedup ratio so, for now, I close
Major limitation in original ZPAQ is slow dedup speed. If I remember well it's single threaded. For example max dedup speed on Core i7-6700 is about 120MiB/s. It becomes major bottleneck now in the era of SSD/NVMe. I'm trying to backup 10TB daily backup, but due to slow dedup speed backup never able to complete within 24h.
Is there easy way to make dedup algo faster ... or any other workarounds?