Open DavidBuchanan314 opened 6 months ago
Here is the test subject: https://github.com/torvalds/linux. Download it (without .git) folder. At the time of writing it had more than 81K Files, 5K Folders and near 1.25 GB in size.
As requested, here are some numbers on tar.zst of Linux source code (the test subject in the note)
Slightly smaller size, and more than 4X faster. (Again, it is on my machine; you need to try it for yourself.) Honestly, ZSTD is great. tar is slowing it down (because of its old design and being one thread). And it is done in two steps: first creating tar and then compression. Pack does all the steps (read, check, compress and write) together and this weaving helped achieve this speed and random access.
You can create in one command compressed tar with tar caf directory.tar.zst directory
the a
switch detect the compression algorithm for tar base on extension.
I will be happy to have numbers of tar.gz, tar.zst and Pack on you particular machine and file system.
I do not expect an order of magnitude difference between tar.zst and Pack; after all, Pack is using Zstandard. What makes Pack fundamentally different from tar.zst is Random Access and other important factors like user experience.
Here how you can use it:
You can do that with Pack:
pack -i ./test.pack --include=/a/file.txt
or a couple files and folders at once:
pack -i ./test.pack --include=/a/file.txt --include=/a/folder/
Use --list
to get a list of all files:
pack -i ./test.pack --list
Such random access using --include
is very fast. As an example, if I want to extract just a .c file from the whole codebase of Linux, it can be done (on my machine) in 30 ms, compared to near 500 ms for WinRAR or 2500 ms for tar.gz. And it will just worsen when you count encryption. For now, Pack encryption is not public, but when it is, you can access a file in a locked Pack file in a matter of milliseconds rather than seconds.
Again, you are encouraged to try them for yourself.
And note that by adding Encryption and Locking to Pack, Random Access will be even more beneficial.
Compression, for tar i set default compression level for zstd, on my system i have it set higher
tokariew in ~/tmp v3.12.2
➜ time pack --pack -i linux-master/ -o ~/Media/tmp/linux.pack
100% 1.4 GB/1.4 GB (984.9 MB/s)
________________________________________________________
Executed in 1.35 secs fish external
usr time 4.02 secs 0.53 millis 4.02 secs
sys time 1.35 secs 1.02 millis 1.35 secs
➜ time ZSTD_CLEVEL=3 tar caf ~/Media/tmp/linux.tar.zst linux-master/
________________________________________________________
Executed in 1.54 secs fish external
usr time 5.10 secs 0.00 millis 5.10 secs
sys time 2.31 secs 1.80 millis 2.31 secs
tokariew in ~/Media/tmp v3.12.2
➜ eza --group-directories-first --icons --hyperlink --no-quotes -la linux.*
.rw-r--r--@ 211M tokariew 2 Apr 10:05 linux.pack
.rw-r--r--@ 215M tokariew 2 Apr 10:10 linux.tar.zst
Extracting single file
tokariew in ~/Media/tmp v3.12.2
➜ time pack -i linux.pack --include=/linux-master/drivers/net/phy/ax88796b_rust.rs
100% 4.0 kB/4.0 kB (0.0 B/s)
________________________________________________________
Executed in 122.50 millis fish external
usr time 193.94 millis 110.99 millis 82.95 millis
sys time 42.61 millis 11.67 millis 30.95 millis
tokariew in ~/Media/tmp v3.12.2
➜ time tar xf linux.tar.zst linux-master/drivers/net/phy/ax88796b_rust.rs
Warning : decompression does not support multi-threading
________________________________________________________
Executed in 1.11 secs fish external
usr time 1.23 secs 0.09 millis 1.23 secs
sys time 1.22 secs 1.93 millis 1.21 secs
oh, even decompression of the while archive seems to be faster
tokariew in ~/Media/tmp v3.12.2
➜ time pack -i linux.pack
100% 1.4 GB/1.4 GB (898.7 MB/s)
________________________________________________________
Executed in 2.29 secs fish external
usr time 2.18 secs 0.00 millis 2.18 secs
sys time 6.94 secs 1.49 millis 6.94 secs
➜ time tar xf linux.tar.zst
Warning : decompression does not support multi-threading
________________________________________________________
Executed in 5.43 secs fish external
usr time 2.16 secs 0.00 millis 2.16 secs
sys time 5.79 secs 1.51 millis 5.79 secs
Can pack uncompress using multiple threads?
Thank you for the numbers. It is still surprising that Pack can even surpass tar.zst, speed and size while both are using the same compression algorithm.
In unpack or decompression, the difference is expectedly huge, as tar needs to scan the whole archive, and forces zst to decompress it. Pack skips to the exact file or files you need.
Can pack uncompress using multiple threads?
Sure. Try giving it a bigger file or folder (more than 16MB, or even more). As if the content to unpack is small, Pack will not use multi threads to save resources.
Thank you for the numbers. It is still surprising that Pack can even surpass tar.zst, speed and size while both are using the same compression algorithm.
tokariew in ~/tmp v3.12.2
➜ du --apparent-size -sk linux-master* | sort
1387888 linux-master
1454050 linux-master.tar
probably the main reason for bigger tar.zst
than pack
file is that tar store less efficiently the information about file attributes and tree of files. Nearly 44MB extra is from file/directory headers for tested linux archive
Not sure how much each file attribute take space in pack format, but for tar is 512B. The extra 20MB, is probably for file padding inside tar.
Are there plan for supporting higher compression levels for pack?
I think most of the difference is the file path and not attributes. Pack can store attributes more efficiently, but as it is a problematic field for privacy (Receiver gets private information of sender machine, etc) and implementation (standard difference between Windows and Linux, etc), it is chosen to be ignored for now. Maybe as an optional parameter later on if there is a proper demand for it.
--press=hard
is the only option there is.
There may be more in the future, but with Pack you do not need to choose a level (like 1..9 with ZIP). Just let Pack do its thing, and you will be happy. Hard Press is there for people who want to pack once and unpack many times (like publishing), and it is worth spending extra time on it. Even then, Pack goes the sane way and does not eat your computer just for a kilobyte or two.
source: linux kernel sources
machine Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz
OS: Linux 6.1.0-13-amd64 #1 SMP PREEMPT_DYNAMIC Debian
tool | size MB | RAM MB | time on disk | time in RAM |
---|---|---|---|---|
tar | 1500 | 3 | 26.1 | 0.4 |
tar + zstd | 218 | 42 | 13.7 | 6.1 |
pack | 215 | 120 | 5.0 | 4.2 |
7z | 147 | 563 | 204.3 | (*) |
*: not done, would probably save 10s.
@setop Thank you very much for sharing the results. They look good. What was the file system? Is it ever useful and practical to create a pack file in RAM, in your opinion?
What was the file system?
It is ext4 FS on an INTEL SSDSCKKF512H6 SATA 512GB
Is it ever useful and practical to create a pack file in RAM, in your opinion?
It was only for benchmark purpose.
But I would have liked the command line not to ignore the --output
argument and always put the pack file along the input folder 😅
But I would have liked the command line not to ignore the --output argument and always put the pack file along the input folder
If you do not set the --output
the output will be saved along with the input, if possible.
The comparison on the Notes page of the website does not mention .tar.zstd, which I imagine would be a close competitor.
Similarly, zip optionally supports zstd, although support for it isn't particularly widespread yet.
It would also be good to know which test files were used, so the benchmark can be independently replicated.