Slow dedup speed - Githubissues

mirogeorg commented 3 years ago

Major limitation in original ZPAQ is slow dedup speed. If I remember well it's single threaded. For example max dedup speed on Core i7-6700 is about 120MiB/s. It becomes major bottleneck now in the era of SSD/NVMe. I'm trying to backup 10TB daily backup, but due to slow dedup speed backup never able to complete within 24h.

Is there easy way to make dedup algo faster ... or any other workarounds?

fcorbelli commented 3 years ago

The dedup algo is very fast, very well implemented by mr. Mahoney. Using a faster one (for example BLAKE3 with hardware acceleration, XXH3, even SHA-1 with hardware acceleration on AMD Ryzen) get only limited benefits (~10%), not worth the broken compatibility.

The deduplication stage takes little time, in fact, in the overall time. For big files the major problem is the bandwith: reading back 10TB will take a long time (even 400GB). zpaqfranz can use multithread hashing with the sum() command (the hasher) and -all switch, way faster with solid state drives (on my PC up to 17GB/s).

But, reading a single file, it cannot be done. Or, better, I have to think on it.

Check your (max) SHA-1 speed with the b (benchmark command) with -sha1

zpaqfranz b -sha1

On my PC it is more than 900MB/s, faster than SSD bandwith (not for NVMe)

mirogeorg commented 3 years ago

My setup is AMD 5950X, 128GB RAM, 980PRO NVMe. copied one 221MB file 160 times. SUm of all files is ~33GB.

All files are cached in RAM so there is no read access during add. CPU is constantly busy at 5% at 4.60-4.70GHz, HDD 0%

Created archive - 168.44 MB/sec
Added again, with -force - 156.49 MB/sec
Ran benchmarks - output is at bottom
Ran zpaqfranz test - ~3.2GB/s
ran zpaqfranz x -all -force - 756mb/s, SDD up to 50%, CPU up to 70%

Here is some output:

D:\@@@>zpaqfranz a \bb -threads 32 zpaqfranz v54.6-experimental (HW BLAKE3), SFX64 v52.15, compiled Sep 18 2021 Integrity check type: XXHASH64+CRC-32 + CRC-32 by fragments /bb.zpaq: 1 versions, 8 files, 2.924 fragments, 19.012.625 bytes (18.13 MB) Updating /bb.zpaq at offset 19.012.625 + 0 Adding 33.912.015.680 (31.58 GB) in 160 files at 2021-09-21 13:44:20 6.25% 00:03:05 ( 1.97 GB) of ( 31.58 GB) 168.44 MB/sec D:\@@@>zpaqfranz a \bb -threads 32 -all zpaqfranz v54.6-experimental (HW BLAKE3), SFX64 v52.15, compiled Sep 18 2021 Integrity check type: XXHASH64+CRC-32 + CRC-32 by fragments /bb.zpaq: 1 versions, 8 files, 2.924 fragments, 19.012.625 bytes (18.13 MB) Updating /bb.zpaq at offset 19.012.625 + 0 Adding 35.607.616.464 (33.16 GB) in 168 files at 2021-09-21 13:44:37 100.00% 00:00:00 ( 33.16 GB) of ( 33.16 GB) 156.49 MB/sec 176 +added, 9 -removed.

19.012.625 + (35.607.616.464 -> 0 -> 1.022.929) = 20.035.554

222.156 seconds (000:03:42) (all OK)

D:\@@@>zpaqfranz a \bb * -threads 32 -force zpaqfranz v54.6-experimental (HW BLAKE3), SFX64 v52.15, compiled Sep 18 2021 Integrity check type: XXHASH64+CRC-32 + CRC-32 by fragments /bb.zpaq: 2 versions, 184 files, 2.924 fragments, 20.035.554 bytes (19.11 MB) Updating /bb.zpaq at offset 20.035.554 + 0 Adding 35.607.616.464 (33.16 GB) in 168 files at 2021-09-21 13:51:54 100.00% 00:00:00 ( 33.16 GB) of ( 33.16 GB) 156.49 MB/sec 176 +added, 186 -removed.

20.035.554 + (35.607.616.464 -> 0 -> 1.035.636) = 21.071.190

221.937 seconds (000:03:41) (all OK)

D:\@@@>zpaqfranz b zpaqfranz v54.6-experimental (HW BLAKE3), SFX64 v52.15, compiled Sep 18 2021 Benchmarks: XXHASH64 XXH3 SHA-1 SHA-256 BLAKE3 CRC-32 CRC-32C WYHASH WHIRLPOOL MD5 SHA-3 Time limit 5 s (-n X) Chunks of 390.62 KB (-minsize Y)

00000005 s XXHASH64: speed ( 5.93 GB/s) 00000005 s XXH3: speed ( 6.69 GB/s) 00000005 s SHA-1: speed ( 898.74 MB/s) 00000005 s SHA-256: speed ( 223.31 MB/s) CPU feature 001F 00000005 s BLAKE3: speed ( 3.47 GB/s) 00000005 s CRC-32: speed ( 8.87 GB/s) 00000005 s CRC-32C: speed ( 7.05 GB/s) 00000005 s WYHASH: speed ( 8.44 GB/s) 00000005 s WHIRLPOOL: speed ( 182.65 MB/s) 00000005 s MD5: speed ( 822.45 MB/s) 00000005 s SHA-3: speed ( 435.18 MB/s) Results

WHIRLPOOL: 182.65 MB/s (done 913.24 MB) SHA-256: 223.31 MB/s (done 1.09 GB) SHA-3: 435.18 MB/s (done 2.12 GB) MD5: 822.45 MB/s (done 4.02 GB) SHA-1: 898.74 MB/s (done 4.39 GB) BLAKE3: 3.47 GB/s (done 17.38 GB) XXHASH64: 5.93 GB/s (done 29.57 GB) XXH3: 6.69 GB/s (done 33.35 GB) CRC-32C: 7.05 GB/s (done 35.13 GB) WYHASH: 8.44 GB/s (done 42.08 GB) CRC-32: 8.87 GB/s (done 44.20 GB)

55.031 seconds (000:00:55) (all OK)

D:\@@@>zpaqfranz x \aa -all -force zpaqfranz v54.6-experimental (HW BLAKE3), SFX64 v52.15, compiled Sep 18 2021 /aa.zpaq: 2 versions, 81.308 files, 291.801 fragments, 2.254.200.954 bytes (2.10 GB) Non-latin (UTF-8) 81 Extracting 49.202.038.464 bytes (45.82 GB) in 81.294 files -threads 32 98.41% 00:00:00 ( 45.09 GB) of ( 45.82 GB) 756.97 MB/sec 74.422 seconds (000:01:14) (all OK)

fcorbelli commented 3 years ago

I have just the same configuration, and as you can see the SHA-1 is running @ about 900MB/s. zpaq deduplicate in a single thread (I am working on it to become multithread, but it is not so easy). Read all the file, 4K at time, then calculate SHA-1 and do a lot of things. The problem is that you need to read the entire file from the media. If the file is huge (ex a vmdk) you will get a maximum speed of about 900MB/s (the SHA-1) in the deduplication stage. If you use a spinning drive (or SSD) you will have no bottleneck in this case (about 150MB/s from disk, 500MB/s from SSD)

If the files are small (say thousands of .DOC) then the multithread can benefit. You can check yourself, try

zpaqfranz sum d:\something -sha1 -summary

and

zpaqfranz sum d:\something -sha1 -summary -all

So yes, a multithread deduplicator will be better on small files, a fast NVMe and many CPU, but, in fact, not much better (total time).

To speed up a lot of work is needed, not only a faster deduplicator

fcorbelli commented 3 years ago

About t (test): there are two stages. In the first (as 7.15) the check is done on the SHA-1-stored In the second (zpaqfranz) a CRC-32 (much faster) runs to detect SHA-1 collisions. If you test against a directory you will get the max speed of SHA-1 (in fact you are re-reading in variable-sized chunks from disk, calc SHA1, compare with stored). This is in fact fast, very fast, for slow media.

For much faster one (ex. xxhash64 or XXH3) the v (verify) command run much faster, but it is a check-against-the-filesystem and not a archive-integrity-check (you need something, the original files, online)

zpaqfranz a z:\1.zpaq c:\dropbox\dropbox

will create the z:\1.zpaq, with xxhash64 (by default on zpaqfranz)

running

zpaqfranz v z:\1.zpaq

a single-threaded verify against filesystem for xxhash64 will run

Note: if you are paranoid you can do

zpaqfranz a z:\1.zpaq c:\dropbox\dropbox -sha3

or -sha2, or blake3, or whatever

OK, then

zpaqfranz t z:\1.zpaq

will do a integrity file check (as said two stages, first as 7.15 and second for collision) but

zpaqfranz t z:\1.zpaq c:\dropbox\dropbx

will invoke the SHA1-chunked verify against the filesytem (something similar to 7.15)

If you are really paranoid

zpaqfranz p z:\1.zpaq

and more

zpaqfranz p z:\1.zpaq -verify

mirogeorg commented 3 years ago

Thank you. Now I understand the problem better. Multithreaded reading of course is limiting too, but most other utilities also has this limitation, so it's not so obvious. It's not ZPAQ specific problem...

Slow adding seems most limiting for me recently. Your idea for multithreaded reading from disk is cool too, but seems complex to implement.

fcorbelli commented 3 years ago

Please check if "slow adding" is by... reading (from task manager). If zpaq/zpaqfranz, during add of something big (.vmdk etc), read constantly by (for example) 400MB/s, so it is mainly a media-bandwidth limitation (non cache here). If read @ 20MB/s (just an example) something weird is running.

Adding require re-reading everything from the filesystem, hashing, then "do the rest". With vmdks, for example, is rather normal to get 1 hour of about nothing (read...read...read...) and maybe 5 minute of writing on the archive.

I'll add a timer for the dedup stage, with something like "starting dedup stage"... "ended dedup in 2000 s, let's do something..."

mirogeorg commented 3 years ago

Verified this, problem is not with slow reading and/or writing. 10TB source data is stored on 3x870QVO 8TB (RAID0). During add/reread they sit mostly idle. Their cumulative read speed is well above 1GB/s, up to 1.4GB/s. Destination media are HDDs that can absorb writes at about 350mb/s. They are busy up to 20% during initial add and mostly idle afterwards. In my case on Core I7-6700K add rate is about 105-110mb/s. To add 10TB in 24h cumulative speed need to be at least 120mb/s.

Initial add is about 35hours and CPU is busy. On 2nd add, 3rd add and so on CPU sits mostly idle during deduplicaton - 1 core is used because most of the data already in archive.

fcorbelli commented 3 years ago

Using decent Xeon machines with about 1GB/s real bandwith I get ~500GB/hour for .vmdk updating, so ~10TB for day, or ~110MB/s sustained.

To get more speed I use more than one process in parallel (NVMe drive on zfs, so no problem of latency with concurrent access), for different virtual machines, capped by -t2 (no more than 2 threads each), so I can run 2 to 8 (different hardware) updates

As stated I will embed some (optional) profiling to see where the software consume the time

mirogeorg commented 3 years ago

Ok, I will test this workaround - more than one process in parallel for different virtual machines, capped by -t2

fcorbelli commented 3 years ago

Run a multithreaded deduplicator will degrade dedup ratio so, for now, I close

fcorbelli / zpaqfranz

Slow dedup speed #8