Closed lee-b closed 5 months ago
Also I think that making a post on encode.su would be cool, since thats the worlds most regarded forum dedicated to data compression algorithms (the creator of ffmpeg and qemu is a regular there)
Maybe could compare with this tool? https://bellard.org/ts_zip/ (the creator of ffmpeg and qemu)
AFAIK they disabled creating an account there, at least last time I checked @dillfrescott
Oh
Considering LLMs are the best known algorithm for compressing all of humanity's data into several hundred gigabytes, I expect your comparison table to show some jaw dropping results next to conventional methods. ))))
As good a method this is, I'm still getting better ratios using paq8px specifically using the -12la
flags, it uses byte level entropy coding iirc among some other things, and trains an lstm model on the fly.
For example, I compressed a 478 kb file down to 39.4 kb using Meta-Llama-3-8B-Instruct.Q8_0.gguf at a 40% overlap, and with paq8pxd it took 1/10th the amount of time (i used a 4090 for llama-zip) and compressed the same file down to 35 kb.
This is still some amazing progress in data compression though, I don't mean for this to take away from its amazing compression abilities at all.
Also I messed around with nncp for hours tuning the hyperparameters, which is the highest scoring data compressor on Matt Mahoney's large text compression leaderboard, and paq8px still always gave much better ratios than even nncp.
Thank you all for your advice and interest! As an update on this, I've now added some compression ratio results for llama-zip
on the README. It looks like ts_zip
and nncp
are closed source and do not come with macOS binaries though, so I won't be able to run those myself. If anyone is particularly interested in getting them added to the results, though, I'd be happy to accept a PR. As for paq8pxd
, after some effort, I managed to get this variant to compile, but it looks like the flag options are different from the version you used, @dillfrescott. Namely:
paq8pxd107 archiver (C) 2021, Matt Mahoney et al.
To compress:
paq8pxd107 -slevel file (compresses to file.paq8pxd107)
paq8pxd107 -slevel archive files... (creates archive.paq8pxd107)
paq8pxd107 file (level -8 pause when done)
level: -s0 store
-s1...-s3 (uses 393, 398, 409 MB)
-s4...-s9 (uses 1.2 1.3 1.5 1.9 2.7 4.9 GB)
-s10...-s15 (uses 7.0 9.0 11.1 27.0 x.x x.x GB)
You may also compress directories.
To extract or compare:
paq8pxd107 -d dir1/archive.paq8pxd107 (extract to dir1)
paq8pxd107 -d dir1/archive.paq8pxd107 dir2 (extract to dir2)
paq8pxd107 archive.paq8pxd107 (extract, pause when done)
To view contents: paq8pxd107 -l archive.paq8pxd107
I went ahead and ran it with -s12
on book1 (a 769 KB file) from the Calgary corpus and it achieved a compression ratio of 4.203, which is not quite as good as llama-zip
but still competitive. Also, -s13
is only marginally better (4.205). @dillfrescott, would you happen to know this is comparable with the variant you used?
paq8pxd
is a good compressor but not nearly as good as paq8px in my testings.
But it all depends on the text compressed and its entropy level, of course.
paq8px
is now added to the table! Going to close this now as I think a good number of utilities are now included in the comparison.
Excellent!
@AlexBuz You mentioned that the code for nncp is closed source. While I am not sure about the license, the actual nncp code does appear to be available as c++ files when you download the linux version. The only part that is closed source is the libnc, which is a separate library that allows the use of the gpu.
Also Fabrice mentions that even the source code of libnc can be shared with you, given you have a good enough reason to need it. That doesnt make it open source but its something at least.
Ah I see, so it's only ts_zip
whose code is unavailable. I will try to compile nncp
then and get it added to the table.
When you create a new compression algorithm, it's customary to post the compression ratios achieved, in a little table vs. similar tools ;) Since they're likely so good here, I'd recommend it ;) Some summary of known issues and limitations would be good too.