Long and very long range redundancy removal

giovariot commented 2 years ago

The problem is that these days, with huge hard disks being so cheap and so common is not unusual to forget about files you already saved and copy them again in another folder, or to try to reorganize one's lifetime files in different ways.

As an example I bring a 1TB SSD that I backed up from a friend of mine who inadvertently formatted it which had 2 partitions: one where they kept files well organized in clearly named folders and files, and a second partition used as a "messy" storage with lots of the same data from old backups used as a source to keep organizing the data on the first partition.

I’ve tried using ckolivas/lrzip which has implemented this kind of unlimited range redundancy removal through its -U option: there's a noticeable difference when compressing big disk images backups on using zstd and lrzip. Of course zstd is faster but removing redundancies through ckolivas/lrzip most of the times gets a smaller file in the end. For example on a 127 GB disk image with 3 different partitions containing 3 different windows versions and with free space zeroed out:

using lrzip -U -l -p1 resulted in a 28.94 GB file
using zstd -T0 --long -19 resulted in a 33.66 GB file

I've then tried using its -U -n option (only redundancy removal, compression disabled) on the 1000.2 GB disk image I was naming as an example: it resulted in a 732.4 GB file. Just using a simple lzo compression algorithm with lrzip’s option -U -l resulted in a 695 GB file. Using zstd on the original img took lots of hours (~40) and only got a 867.6 GB file.

I know I could have simply used lrzip -U -n to remove the redundancies and then compress the output file with zstd, but I think this could be a very useful feature for zstd itself indeed: I’m pretty sure the fast speed and high compression ratios combo make it a very common algorithm to compress disk images.

So I'd find it very useful to implement some sort of redundancy removal process:

ckolivas/lrzip makes it using a “sliding mmap” (not sure I’ve actually understood the concept, but it should be a bit-by-bit redundancy removal process) it is quite slow even if very effective
another way that comes to my mind could be some sort of magic byte recognition algorithm that could remove also some redundancies created by deleted files (that would be faster, but it would be limited to non fragmented files)
another way could be supporting different common filesystems and reducing redundancy through some sort of “duplicate removal” process using hashes. Maybe even faster but this would totally forgo deleted files.

Don’t really know if anything of this is actually feasible inside the current zstd framework, but I really think some sort of long range redundancy removal should be part of it, considering its possible uses also in commercial environments (virtual machine image compression for example would be an area where this feature would surely be beneficial).

Thanks in advance, thanks for the great software and sorry for my pretty bad English, it’s not my mother tongue.

Cyan4973 commented 2 years ago

As a stopgap, you could start using --long=31. This will considerably increase the finder's range (and the memory requirement) and should narrow the distance with lrzip.

Beyond that, we know that the current implementation is limited (though not the format, it's purely an implementation issue). Going beyond that limitation is substantial work though. In a world of infinite resources, this could be taken care of. But that's not the world we live in, we have to balance this effort vs other ongoing efforts that are being achieved instead. So far, it hasn't reached the top of the list, so it's a priority issue. But we are aware of the situation, this item is part of a list of long-term goals, and one of these days, it will be its turn.

Beiri22 commented 2 years ago

@giovariot Have you tried the --long=31 option? What are the results compared to default --long and lrzip?

giovariot commented 2 years ago

The compression is a bit better than simply using --long, but not really as good as lrzip's one.

I can try again timing the whole thing, but I don't have much free time lately so I'm not sure how long it's gonna be until I'll have some news.

AlexDaniel commented 1 year ago

I'm seeing similar results – lrzip being overall a bit “better” – higher compression ratio with faster compression. --long=31 helps a little bit, but it's not enough. Here's some real-world example (<30MB archive that has >1GB worth of data): https://whateverable.6lang.org/0dc6efd6314bcc27c9b8351453d2ec8ca10196df

And some rough measurements for that data:	Command	Compression time	File size
lrzip -q -L 9	≈26s	≈8M	≈5.3s
zstd -19 --long=31	≈97s	≈12M	≈0.8s
zstd --long=31	≈3.7s	≈16M	≈0.8s

facebook / zstd

Long and very long range redundancy removal #3062