AlexBuz / llama-zip

LLM-powered lossless compression tool
Other
252 stars 9 forks source link

Post compression ratios #3

Closed lee-b closed 5 months ago

lee-b commented 5 months ago

When you create a new compression algorithm, it's customary to post the compression ratios achieved, in a little table vs. similar tools ;) Since they're likely so good here, I'd recommend it ;) Some summary of known issues and limitations would be good too.

dillfrescott commented 5 months ago

Also I think that making a post on encode.su would be cool, since thats the worlds most regarded forum dedicated to data compression algorithms (the creator of ffmpeg and qemu is a regular there)

asukaminato0721 commented 5 months ago

Maybe could compare with this tool? https://bellard.org/ts_zip/ (the creator of ffmpeg and qemu)

secemp9 commented 5 months ago

AFAIK they disabled creating an account there, at least last time I checked @dillfrescott

dillfrescott commented 5 months ago

Oh

bunnyfu commented 5 months ago

Considering LLMs are the best known algorithm for compressing all of humanity's data into several hundred gigabytes, I expect your comparison table to show some jaw dropping results next to conventional methods. ))))

dillfrescott commented 5 months ago

As good a method this is, I'm still getting better ratios using paq8px specifically using the -12la flags, it uses byte level entropy coding iirc among some other things, and trains an lstm model on the fly.

For example, I compressed a 478 kb file down to 39.4 kb using Meta-Llama-3-8B-Instruct.Q8_0.gguf at a 40% overlap, and with paq8pxd it took 1/10th the amount of time (i used a 4090 for llama-zip) and compressed the same file down to 35 kb.

This is still some amazing progress in data compression though, I don't mean for this to take away from its amazing compression abilities at all.

dillfrescott commented 5 months ago

Also I messed around with nncp for hours tuning the hyperparameters, which is the highest scoring data compressor on Matt Mahoney's large text compression leaderboard, and paq8px still always gave much better ratios than even nncp.

AlexBuz commented 5 months ago

Thank you all for your advice and interest! As an update on this, I've now added some compression ratio results for llama-zip on the README. It looks like ts_zip and nncp are closed source and do not come with macOS binaries though, so I won't be able to run those myself. If anyone is particularly interested in getting them added to the results, though, I'd be happy to accept a PR. As for paq8pxd, after some effort, I managed to get this variant to compile, but it looks like the flag options are different from the version you used, @dillfrescott. Namely:

paq8pxd107 archiver (C) 2021, Matt Mahoney et al.

To compress:
  paq8pxd107 -slevel file               (compresses to file.paq8pxd107)
  paq8pxd107 -slevel archive files...   (creates archive.paq8pxd107)
  paq8pxd107 file                       (level -8 pause when done)
level: -s0          store
  -s1...-s3         (uses 393, 398, 409 MB)
  -s4...-s9         (uses 1.2  1.3  1.5  1.9 2.7 4.9 GB)
  -s10...-s15       (uses 7.0  9.0 11.1 27.0   x.x x.x GB)
You may also compress directories.

To extract or compare:
  paq8pxd107 -d dir1/archive.paq8pxd107      (extract to dir1)
  paq8pxd107 -d dir1/archive.paq8pxd107 dir2 (extract to dir2)
  paq8pxd107 archive.paq8pxd107              (extract, pause when done)

To view contents: paq8pxd107 -l archive.paq8pxd107

I went ahead and ran it with -s12 on book1 (a 769 KB file) from the Calgary corpus and it achieved a compression ratio of 4.203, which is not quite as good as llama-zip but still competitive. Also, -s13 is only marginally better (4.205). @dillfrescott, would you happen to know this is comparable with the variant you used?

dillfrescott commented 5 months ago

paq8pxd is a good compressor but not nearly as good as paq8px in my testings.

dillfrescott commented 5 months ago

But it all depends on the text compressed and its entropy level, of course.

AlexBuz commented 5 months ago

paq8px is now added to the table! Going to close this now as I think a good number of utilities are now included in the comparison.

dillfrescott commented 5 months ago

Excellent!

dillfrescott commented 5 months ago

@AlexBuz You mentioned that the code for nncp is closed source. While I am not sure about the license, the actual nncp code does appear to be available as c++ files when you download the linux version. The only part that is closed source is the libnc, which is a separate library that allows the use of the gpu.

dillfrescott commented 5 months ago

Also Fabrice mentions that even the source code of libnc can be shared with you, given you have a good enough reason to need it. That doesnt make it open source but its something at least.

AlexBuz commented 5 months ago

Ah I see, so it's only ts_zip whose code is unavailable. I will try to compile nncp then and get it added to the table.