Closed ThomasWaldmann closed 8 years ago
From the IRC conversation earlier and so the information persists:
lz4
On the already compressed front, for lz4-https://github.com/Cyan4973/lz4/blob/master/lz4_Frame_format.md . See Data Blocks. The legacy frame did not do this, according to the next section, but that has been deprecated. A test on my old machine. time lz4 postgresql-9.4.4.tar.bz2 Compressed filename will be : postgresql-9.4.4.tar.bz2.lz4 Compressed 17616272 bytes into 17388193 bytes ==> 98.71% real 0m0.052s user 0m0.021s sys 0m0.030s l postgresql-9.4.4.tar.bz2* -rw-r--r-- 1 aklaver users 17616272 Nov 17 15:33 postgresql-9.4.4.tar.bz2 -rw-r--r-- 1 aklaver users 17388193 Nov 17 15:34 postgresql-9.4.4.tar.bz2.lz4
lzma
Hmm, for lzma I have not found anything conclusive yet, though the numbers seem to indicate it still tries to compress. time lzma postgresql-9.4.4.tar.bz2 real 0m6.894s user 0m6.777s sys 0m0.109s l postgresql-9.4.4.tar.bz2* -rw-r--r-- 1 aklaver users 17388193 Nov 17 15:34 postgresql-9.4.4.tar.bz2.lz4 -rw-r--r-- 1 aklaver users 17384834 Nov 17 15:33 postgresql-9.4.4.tar.bz2.lzma
zlib
http://zlib.net/zlib_tech.html "zlib's compression method, an LZ77 variant called deflation, emits compressed data as a sequence of blocks. Various block types are allowed, one of which is stored blocks—these are simply composed of the raw input data plus a few header bytes. In the worst possible case, where the other block types would expand the data, deflation falls back to stored (uncompressed) blocks. Thus for the default settings used by deflateInit(), compress(), and compress2(), the only expansion is an overhead of five bytes per 16 KB block (about 0.03%), plus a one-time overhead of six bytes for the entire stream. Even if the last or only block is smaller than 16 KB, the overhead is still five bytes. In the absolute worst case of a single-byte input stream, the overhead therefore amounts to 1100% (eleven bytes of overhead, one byte of actual data). For larger stream sizes, the overhead approaches the limiting value of 0.03%. "
Do not have a stand alone copy of zlib, using gzip which uses the same code:
aklaver@panda:~> time gzip postgresql-9.4.4.tar.bz2
real 0m0.690s user 0m0.667s sys 0m0.023s
aklaver@panda:~> l postgresql-9.4.4.tar.bz2.* -rw-r--r-- 1 aklaver users 17286547 Nov 18 06:35 postgresql-9.4.4.tar.bz2.gz -rw-r--r-- 1 aklaver users 17388193 Nov 17 15:34 postgresql-9.4.4.tar.bz2.lz4 -rw-r--r-- 1 aklaver users 17384834 Nov 17 15:33 postgresql-9.4.4.tar.bz2.lzma
Tried on different machine(VM) that has minizip(minimal zlib implementation):
aklaver@arkansas:~$ time minizip -o postgresql-9.4.4.zlib postgresql-9.4.4.tar.bz2 MiniZip 1.1, demo of zLib + MiniZip64 package, written by Gilles Vollant more info on MiniZip at http://www.winimage.com/zLibDll/minizip.html
creating postgresql-9.4.4.zlib File : postgresql-9.4.4.tar.bz2 is 17616272 bytes
real 0m0.748s user 0m0.717s sys 0m0.030s aklaver@arkansas:~$ l postgresql-9.4.4.* -rw-rw-r-- 1 aklaver aklaver 17616272 Nov 18 06:55 postgresql-9.4.4.tar.bz2 -rw-rw-r-- 1 aklaver aklaver 17289462 Nov 18 06:56 postgresql-9.4.4.zlib
So it seems that we don't need to worry about compressed data being larger than the original data.
Close?
I would say so. I will do some more experimenting for my own curiosity and if I find anything I can always open a new issue.
The only format I had some questions about was lzma. Looking at the Borg source code I see lzma.compress() is used. By default the format is uses is xz. xz is, from what I can find, a refinement of lzma. In particular it uses lzma2, which deals better with already compressed data. Some rough testing showed that to be true. It is definitely slower then lz4 or zlib, but it behaves well when it comes to file size not increasing.
ok, thanks for the investigation. closing this as non-issue for now.
Unclear:
If not, we could easily add a check to borg code and store chunks uncompressed if compression was found to be ineffective.
See also #383.