Open dmbarbour opened 10 years ago
Candidate: variation on LZSS that outputs 8 bytes for every literal flag, and otherwise uses greedy matches. Use fixed width offset encoding (otherwise reasoning about optimality is difficult), but potentially a variable-width length encoding.
We might need to reduce match size just a little if the last elements to turn a subsequent literal into a minimum sized match, but only if we're already peaking on match size.
LZSS8 comes at an opportunity cost: some opportunities for compression are lost within and on the edges of those 8-byte literal sequences. This shouldn't be an issue for pure ABC, but we can regain some opportunities if the window size is configured properly to compress even byte pairs.
Candidate: Apply LZSS normally (for optimal encoding), then apply a second compression pass that provides a bit flag indicating whether the next 8 elements were all literals. Choose a minimum match size such that the worst case for a single match of eight elements is better than 1:64 overhead.
Example: if minimum match size is three, and offset-length pair is 16 bits, then the worst case is slightly better than 1:64 even if there is any compression at all (56+16 bits to encode literals+match, plus another 8 bits for LZSS flags = 80 bits, plus one bit for the second pass compression, for 1:80 expansion with any match at all vs. 1:64 for all literals).
This has the advantage of being very simple and byte alignable, while ensuring acceptable worst-case performance. Embedded binaries that aren't compressible would expand 1:64 + 1:128 for a total 2.34%, which is under the 2.5% goal. A lot of ABC code should compress a lot better, but I've yet to see the actual ratios.
I'm thinking: window size 2k, match size 3..34 (0..31 + 3). ABC has other options for 'compressing' very repeated sequences, via ABCD and resources.
Finish designing/choosing compression and encryption algorithms for ABC resources.
Requirements and Desiderata for Compression:
The low worst-case expansion should include the existing expansion of 1:128 for large embedded binaries. Binaries - e.g. cipher text, secure hashes, compressed audio-video data - are the worst culprits for expansion anyway. Anyhow, this gives us up to another 1:64 acceptable expansion ratio for 2.34% expansion total.