PackOrganization / Pack

Pack
https://pack.ac
Apache License 2.0
243 stars 9 forks source link

Compare with .tar.zstd #4

Open DavidBuchanan314 opened 6 months ago

DavidBuchanan314 commented 6 months ago

The comparison on the Notes page of the website does not mention .tar.zstd, which I imagine would be a close competitor.

Similarly, zip optionally supports zstd, although support for it isn't particularly widespread yet.

It would also be good to know which test files were used, so the benchmark can be independently replicated.

OttoCoddo commented 6 months ago

Here is the test subject: https://github.com/torvalds/linux. Download it (without .git) folder. At the time of writing it had more than 81K Files, 5K Folders and near 1.25 GB in size.

As requested, here are some numbers on tar.zst of Linux source code (the test subject in the note)

Slightly smaller size, and more than 4X faster. (Again, it is on my machine; you need to try it for yourself.) Honestly, ZSTD is great. tar is slowing it down (because of its old design and being one thread). And it is done in two steps: first creating tar and then compression. Pack does all the steps (read, check, compress and write) together and this weaving helped achieve this speed and random access.

Tokariew commented 6 months ago

You can create in one command compressed tar with tar caf directory.tar.zst directory the a switch detect the compression algorithm for tar base on extension.

OttoCoddo commented 6 months ago

I will be happy to have numbers of tar.gz, tar.zst and Pack on you particular machine and file system.

I do not expect an order of magnitude difference between tar.zst and Pack; after all, Pack is using Zstandard. What makes Pack fundamentally different from tar.zst is Random Access and other important factors like user experience. Here how you can use it: You can do that with Pack: pack -i ./test.pack --include=/a/file.txt

or a couple files and folders at once:

pack -i ./test.pack --include=/a/file.txt --include=/a/folder/

Use --list to get a list of all files:

pack -i ./test.pack --list

Such random access using --include is very fast. As an example, if I want to extract just a .c file from the whole codebase of Linux, it can be done (on my machine) in 30 ms, compared to near 500 ms for WinRAR or 2500 ms for tar.gz. And it will just worsen when you count encryption. For now, Pack encryption is not public, but when it is, you can access a file in a locked Pack file in a matter of milliseconds rather than seconds. Again, you are encouraged to try them for yourself.

And note that by adding Encryption and Locking to Pack, Random Access will be even more beneficial.

Tokariew commented 6 months ago

Compression, for tar i set default compression level for zstd, on my system i have it set higher

tokariew in ~/tmp  v3.12.2 
➜ time pack --pack -i linux-master/ -o ~/Media/tmp/linux.pack
100%    1.4 GB/1.4 GB (984.9 MB/s)                                               

________________________________________________________
Executed in    1.35 secs    fish           external
   usr time    4.02 secs    0.53 millis    4.02 secs
   sys time    1.35 secs    1.02 millis    1.35 secs

➜ time ZSTD_CLEVEL=3 tar caf ~/Media/tmp/linux.tar.zst linux-master/

________________________________________________________
Executed in    1.54 secs    fish           external
   usr time    5.10 secs    0.00 millis    5.10 secs
   sys time    2.31 secs    1.80 millis    2.31 secs

tokariew in ~/Media/tmp  v3.12.2 
➜ eza --group-directories-first --icons --hyperlink --no-quotes -la linux.*
.rw-r--r--@ 211M tokariew  2 Apr 10:05  linux.pack
.rw-r--r--@ 215M tokariew  2 Apr 10:10  linux.tar.zst

Extracting single file

tokariew in ~/Media/tmp  v3.12.2 
➜ time pack -i linux.pack --include=/linux-master/drivers/net/phy/ax88796b_rust.rs
100%    4.0 kB/4.0 kB (0.0 B/s)                                             

________________________________________________________
Executed in  122.50 millis    fish           external
   usr time  193.94 millis  110.99 millis   82.95 millis
   sys time   42.61 millis   11.67 millis   30.95 millis

tokariew in ~/Media/tmp  v3.12.2 
➜ time tar xf linux.tar.zst linux-master/drivers/net/phy/ax88796b_rust.rs
Warning : decompression does not support multi-threading

________________________________________________________
Executed in    1.11 secs    fish           external
   usr time    1.23 secs    0.09 millis    1.23 secs
   sys time    1.22 secs    1.93 millis    1.21 secs
Tokariew commented 6 months ago

oh, even decompression of the while archive seems to be faster

tokariew in ~/Media/tmp  v3.12.2 
➜ time pack -i linux.pack
100%    1.4 GB/1.4 GB (898.7 MB/s)                                               

________________________________________________________
Executed in    2.29 secs    fish           external
   usr time    2.18 secs    0.00 millis    2.18 secs
   sys time    6.94 secs    1.49 millis    6.94 secs

➜ time tar xf linux.tar.zst
Warning : decompression does not support multi-threading

________________________________________________________
Executed in    5.43 secs    fish           external
   usr time    2.16 secs    0.00 millis    2.16 secs
   sys time    5.79 secs    1.51 millis    5.79 secs

Can pack uncompress using multiple threads?

OttoCoddo commented 6 months ago

Thank you for the numbers. It is still surprising that Pack can even surpass tar.zst, speed and size while both are using the same compression algorithm.

In unpack or decompression, the difference is expectedly huge, as tar needs to scan the whole archive, and forces zst to decompress it. Pack skips to the exact file or files you need.

Can pack uncompress using multiple threads?

Sure. Try giving it a bigger file or folder (more than 16MB, or even more). As if the content to unpack is small, Pack will not use multi threads to save resources.

Tokariew commented 6 months ago

Thank you for the numbers. It is still surprising that Pack can even surpass tar.zst, speed and size while both are using the same compression algorithm.

tokariew in ~/tmp  v3.12.2
➜ du --apparent-size -sk linux-master* | sort
1387888 linux-master
1454050 linux-master.tar

probably the main reason for bigger tar.zst than pack file is that tar store less efficiently the information about file attributes and tree of files. Nearly 44MB extra is from file/directory headers for tested linux archive Not sure how much each file attribute take space in pack format, but for tar is 512B. The extra 20MB, is probably for file padding inside tar.

Are there plan for supporting higher compression levels for pack?

OttoCoddo commented 6 months ago

I think most of the difference is the file path and not attributes. Pack can store attributes more efficiently, but as it is a problematic field for privacy (Receiver gets private information of sender machine, etc) and implementation (standard difference between Windows and Linux, etc), it is chosen to be ignored for now. Maybe as an optional parameter later on if there is a proper demand for it.

--press=hard is the only option there is. There may be more in the future, but with Pack you do not need to choose a level (like 1..9 with ZIP). Just let Pack do its thing, and you will be happy. Hard Press is there for people who want to pack once and unpack many times (like publishing), and it is worth spending extra time on it. Even then, Pack goes the sane way and does not eat your computer just for a kilobyte or two.

setop commented 5 months ago

source: linux kernel sources

machine Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz

OS: Linux 6.1.0-13-amd64 #1 SMP PREEMPT_DYNAMIC Debian

tool size MB RAM MB time on disk time in RAM
tar 1500 3 26.1 0.4
tar + zstd 218 42 13.7 6.1
pack 215 120 5.0 4.2
7z 147 563 204.3 (*)

*: not done, would probably save 10s.

OttoCoddo commented 5 months ago

@setop Thank you very much for sharing the results. They look good. What was the file system? Is it ever useful and practical to create a pack file in RAM, in your opinion?

setop commented 5 months ago

What was the file system?

It is ext4 FS on an INTEL SSDSCKKF512H6 SATA 512GB

Is it ever useful and practical to create a pack file in RAM, in your opinion?

It was only for benchmark purpose.

But I would have liked the command line not to ignore the --output argument and always put the pack file along the input folder 😅

OttoCoddo commented 5 months ago

But I would have liked the command line not to ignore the --output argument and always put the pack file along the input folder

If you do not set the --output the output will be saved along with the input, if possible.