ckolivas / lrzip

Long Range Zip
http://lrzip.kolivas.org
GNU General Public License v2.0
618 stars 76 forks source link

Manually set "incompressible data" threshold #207

Closed lr4d closed 2 years ago

lr4d commented 3 years ago

When working with tarballs of media files and PDF's, lrzip sometimes gives me a fast 15% compression by just compressing 1/9 blocks e.g. truncated output of lrzip -i -vv ...:

Block   Comp    Percent Size
1   none    100.0%  10485760 / 10485760 Offset: 4589643704  Head: 10485799
2   none    100.0%  10485760 / 10485760 Offset: 4600129477  Head: 20971572
3   lzma    77.7%   8152171 / 10485760  Offset: 4610615250  Head: 29123756
4   none    100.0%  10485760 / 10485760 Offset: 4618767434  Head: 39609529
5   none    100.0%  10485760 / 10485760 Offset: 4629253207  Head: 50095302
6   none    100.0%  10485760 / 10485760 Offset: 4639738980  Head: 60581075
7   none    100.0%  10485760 / 10485760 Offset: 4650224753  Head: 71066848
8   none    100.0%  10485760 / 10485760 Offset: 4660710526  Head: 81567392
9   none    100.0%  4860321 / 4860321   Offset: 4671211070  Head: 0

Other times, it takes a lot longer and struggles a lot more to compress:

Block   Comp    Percent Size
1   none    100.0%  49603243 / 49603243 Offset: 5173302056  Head: 49603282
2   lzma    99.2%   49182901 / 49603243 Offset: 5222905312  Head: 98786196
3   lzma    99.3%   49236783 / 49603243 Offset: 5272088226  Head: 148022992
4   lzma    99.5%   49339373 / 49603243 Offset: 5321325022  Head: 197362378
5   lzma    99.0%   49131243 / 49603243 Offset: 5370664408  Head: 246493634
6   lzma    99.3%   49263853 / 49603243 Offset: 5419795664  Head: 295757500
7   lzma    98.5%   48863058 / 49603243 Offset: 5469059530  Head: 344620571
8   lzma    98.7%   48981972 / 49603243 Offset: 5517922601  Head: 393602556
9   lzma    99.4%   49286839 / 49603243 Offset: 5566904586  Head: 442889408
10  lzma    99.4%   49310169 / 49603243 Offset: 5616191438  Head: 492199590
11  lzma    99.4%   49295202 / 49603243 Offset: 5665501620  Head: 541494805
12  lzma    99.2%   49216341 / 49603243 Offset: 5714796835  Head: 590711159
13  lzma    99.4%   49310508 / 49603243 Offset: 5764013189  Head: 640021680
14  lzma    99.4%   49310012 / 49603243 Offset: 5813323710  Head: 689331705
15  lzma    99.3%   49260783 / 49603243 Offset: 5862633735  Head: 738592501
16  lzma    99.4%   49304015 / 49603243 Offset: 5911894531  Head: 787896529
17  lzma    99.3%   49272993 / 49603243 Offset: 5961198559  Head: 837169535
18  lzma    99.4%   49321970 / 49603243 Offset: 6010471565  Head: 886491518
19  lzma    99.4%   49315822 / 49603243 Offset: 6059793548  Head: 935807353
20  lzma    99.4%   49288914 / 49603243 Offset: 6109109383  Head: 985273666
21  lzma    90.2%   18692941 / 20716811 Offset: 6158575696  Head: 0

In the latter case, I'd much rather prefer lrzip to not compress any blocks if the expected compression ratio for lz4 is <= 95 %, so as to get faster speed.

Using lrzip --level=1 doesn't seem to make any difference in this regard.

Is it feasible to make the threshold which lrzip uses for determining when the data is incompressible a cli parameter so I can set it manually?

pete4abw commented 3 years ago

The -T | --threshold option was designed to take an optional argument to limit Threshold testing to N%. Somehow that feature did not make it to lrzip. -T alone (an argument is not tested for) will disable threshold testing totally. Not limit it. This feature is implemented in lrzip-next. -T95 for example, would test the lz4 compression against 95% and if above, would not compress that block. Good practice to use it in general. The time saved is a better value than the compression benefit. See This wiki article on it

ckolivas commented 3 years ago

I removed the optional percentage a while ago. You're the first person to request it be implemented. The way it works now however, it aborts way too early to have any idea what the percentage will be by the end of the block; its point is to avoid compressing incompressible blocks entirely and your request is a pretty unique use case. It could be extended to do what you ask but I'm not currently implementing new features.

ckolivas commented 2 years ago

I've decided this isn't worth implementing, apologies.