facebook / zstd

Zstandard - Fast real-time compression algorithm
http://www.zstd.net
Other
23.59k stars 2.1k forks source link

Documentation: better explanations on compressor behaviour, compression levels and parameters are welcome #3698

Closed zougloub closed 7 months ago

zougloub commented 1 year ago

When my kids are going to need to compress stuff, I will tell them to use zstd of course, and I will probably tell them to RTFM. Now I wanted to double-check the zstd man page so as to be sure that the documentation will be straightforward, and I have some comments. Below I will sometimes use the word "should" but keep in mind it's all suggestions.

In the zstd executable man page, the introductory DESCRIPTION is missing some high-level information:

Later the man page first mentions compression levels:

   Operation Modifiers
       ○   -#: selects # compression level [1-19] (default: 3)

One may find this a bit terse. I mean, there's not even a terminating period on this line. And this bullet point in the manual, I think, should be augmented with a concise sentence or paragraph, mentioning:

Then, later, the manual has ADVANCED COMPRESSION OPTIONS, which currently says:

ADVANCED COMPRESSION OPTIONS
       ###  -B#:  Specify  the  size  of  each  compression  job. This parameter is only available when
       multi-threading is enabled. Each compression job is run in parallel, so  this  value  indirectly
       impacts the nb of active threads. Default job size varies depending on compression level (gener‐
       ally 4 * windowSize). -B# makes it possible to manually select a custom size. Note that job size
       must  respect a minimum value which is enforced transparently. This minimum is either 512 KB, or
       overlapSize, whichever is largest. Different job sizes will  lead  to  non-identical  compressed
       frames.

There must be a typo here, and I think that:

level compressed size compression time
0 329820 0.073
1 2880396 0.032
2 727942 0.058
3 329820 0.074
4 273090 0.076
5 339937 0.063
6 307160 0.147
7 317692 0.180
8 523833 0.248
9 476452 0.242
10 444441 0.373
11 460774 0.541
12 496534 0.608
13 463175 0.336
14 463175 0.378
15 513365 0.414
16 507389 1.133
17 428186 1.921
18 339902 2.762
19 442419 3.857
20 442419 4.141
21 442419 4.026
22 442409 5.560

We can see that the compression time may decrease, and/or the compression ration decreases as the compression level is raised.

The DICTIONARY BUILDER and BENCHMARK sections should be moved after the compression options.

The BENCHMARK section should feature an introductory statement, such as the zstd CLI provides a benchmarking mode that can be used to easily find suitable compression parameters, or alternatively to benchmark a computer's performance. Maybe something also statiing that benchmarking for finding compression options should be performed on a representative data set could be useful.

SEE ALSO should point to the zstd manual, which should be installed with zstd, and maybe to the website, since later some other websites are mentioned.

Chaython commented 1 year ago

In your test it seems level 6 is most compressed I wonder how often that is reproducible? How many passes did you run? Singular? I wonder if it varies more based on the content of the archive.

Cyan4973 commented 7 months ago

For reference, is a the current performance on the suggested sample :

 1#issue3698.txt     :   7277497 ->   2880400 (x2.527),  
 2#issue3698.txt     :   7277497 ->    718164 (x10.13),  
 3#issue3698.txt     :   7277497 ->    329820 (x22.07),  
 4#issue3698.txt     :   7277497 ->    273090 (x26.65),  
 5#issue3698.txt     :   7277497 ->    339937 (x21.41),  
 6#issue3698.txt     :   7277497 ->    307160 (x23.69),  
 7#issue3698.txt     :   7277497 ->    317692 (x22.91),   
 8#issue3698.txt     :   7277497 ->    523833 (x13.89),   
 9#issue3698.txt     :   7277497 ->    476452 (x15.27),   
10#issue3698.txt     :   7277497 ->    444441 (x16.37),   
11#issue3698.txt     :   7277497 ->    460774 (x15.79),   
12#issue3698.txt     :   7277497 ->    496534 (x14.66),  
13#issue3698.txt     :   7277497 ->    463175 (x15.71),   
14#issue3698.txt     :   7277497 ->    463175 (x15.71),   
15#issue3698.txt     :   7277497 ->    513365 (x14.18),   
16#issue3698.txt     :   7277497 ->    507389 (x14.34),   
17#issue3698.txt     :   7277497 ->    450909 (x16.14),   
18#issue3698.txt     :   7277497 ->    212426 (x34.26),  
19#issue3698.txt     :   7277497 ->    219162 (x33.21),   
20#issue3698.txt     :   7277497 ->    219162 (x33.21),   
21#issue3698.txt     :   7277497 ->    219162 (x33.21),   
22#issue3698.txt     :   7277497 ->    210478 (x34.58),   

Compression performance is still all over the place across most of the range, with notably fast level 4 offering incredibly good performance, but at least higher compression levels (18+) now perform best, instead of worse.

Cyan4973 commented 7 months ago

These are great recommendations @zougloub !

They have been employed to refactor the documentation at https://github.com/facebook/zstd/pull/3958 .

Cyan4973 commented 7 months ago

documentation updated