dmbarbour / awelon

Awelon project is a new UI model with a new language.
Other
101 stars 4 forks source link

Compression of ABC Resources #17

Open dmbarbour opened 10 years ago

dmbarbour commented 10 years ago

Finish designing/choosing compression and encryption algorithms for ABC resources.

Requirements and Desiderata for Compression:

The low worst-case expansion should include the existing expansion of 1:128 for large embedded binaries. Binaries - e.g. cipher text, secure hashes, compressed audio-video data - are the worst culprits for expansion anyway. Anyhow, this gives us up to another 1:64 acceptable expansion ratio for 2.34% expansion total.

dmbarbour commented 10 years ago

Candidate: variation on LZSS that outputs 8 bytes for every literal flag, and otherwise uses greedy matches. Use fixed width offset encoding (otherwise reasoning about optimality is difficult), but potentially a variable-width length encoding.

We might need to reduce match size just a little if the last elements to turn a subsequent literal into a minimum sized match, but only if we're already peaking on match size.

LZSS8 comes at an opportunity cost: some opportunities for compression are lost within and on the edges of those 8-byte literal sequences. This shouldn't be an issue for pure ABC, but we can regain some opportunities if the window size is configured properly to compress even byte pairs.

dmbarbour commented 10 years ago

Candidate: Apply LZSS normally (for optimal encoding), then apply a second compression pass that provides a bit flag indicating whether the next 8 elements were all literals. Choose a minimum match size such that the worst case for a single match of eight elements is better than 1:64 overhead.

Example: if minimum match size is three, and offset-length pair is 16 bits, then the worst case is slightly better than 1:64 even if there is any compression at all (56+16 bits to encode literals+match, plus another 8 bits for LZSS flags = 80 bits, plus one bit for the second pass compression, for 1:80 expansion with any match at all vs. 1:64 for all literals).

This has the advantage of being very simple and byte alignable, while ensuring acceptable worst-case performance. Embedded binaries that aren't compressible would expand 1:64 + 1:128 for a total 2.34%, which is under the 2.5% goal. A lot of ABC code should compress a lot better, but I've yet to see the actual ratios.

I'm thinking: window size 2k, match size 3..34 (0..31 + 3). ABC has other options for 'compressing' very repeated sequences, via ABCD and resources.