ajbowen249 / dungeon-delver-engine

A Tandy Model 100 and ZX Spectrum implementation of OGL 5.1 in Assembly
4 stars 1 forks source link

[Not Merging] Investigate Huffman Coding #8

Closed ajbowen249 closed 5 months ago

ajbowen249 commented 5 months ago
total ideal bytes 3069 down from 5267 saved 2198
total unpacked bytes 3520 down from 5267 saved 1747
total unpacked bytes with approximate minimal table impl 3646 down from 5267 saved 1621

This branch will not be merged, but I'm using this PR to save this work for posterity and possible future reconsideration, and as a coherent log of the investigation.

The current compression technique is a quick-and-dirty dictionary coder that saves approximately 1547 bytes in the "flagship" project for DDE. That algorithm was chosen as the implementation of any compression technique in this small of a memory space also needs to be small enough to not eat too much into the saved bytes. To that end, Huffman Coding is reasonably simple in theory, but I've been skeptical of my ability to keep the asm side small enough. I was also unsure whether the compression ratio itself would be better than the dictionary coder at this size, though something with more formal definition and usage certainly seemed promising. Another major advantage would be the allowance of special characters (above 127) in compressed text.

To allay the curiosity, I put together a very quick and hacky Huffman table generator that could at least represent the tree in Python and allow some analysis of its characteristics against the flagship project's actual text. Specifically, it allowed me to see the best possible compression performance, as well as speculate about where I would spend more bytes to implement it in asm.

Short answer: Doesn't currently seem worth it.

Long answer: The flagship project currently has 5267 total bytes of text with only 63 unique characters. The generated Huffman tree found ` (space) as the most-used character, with some parts of the tree reaching a depth of12. That is, the most-used characters were shortened from8bits down to3, while the least-used were extended from8up to12`, with a broad spectrum in between.

Ideally, given the specific depths of each character times its number of occurrences, the best possible way to encode that text would shorten it down to 3069 bytes, assuming all of them were packed tightly together with no sort of table. That's a savings of 2198 bytes, about 651 bytes better than the dictionary coder, with no consideration for the implementation, which initially at least seemed promising.

From here, though, I began to speculate about the cost of the implementation. First off, we need some way to know how long each string is. Any implementation I can think of would require at least one byte per individual string. Realistically, I would want each string to start on a proper byte offset, so we'll incur up to 7 bits of padding per string no matter how we order things. Even if we kept it packed, the addressing system would need an extra three bits per string somewhere for a bit offset.

So, now, I totaled up the byte length of each individual string as its length in compressed bits divided by 8 rounded up, plus a byte for the length. That increased the compressed size of the text to 3520 bytes total, a mere 200 byte improvement from the dictionary compressor, without accounting for the binary version of the Huffman table or any necessary code to use it.

The final addition to the estimate is the least-thought-out, but I added an additional two bytes per unique character to the total. We need a byte for the character itself, and any method of referencing it I can think of (e.g. as a constant to ld a, <char>, an offset index into a table, etc...) would take at least one more byte. This bumped us up to at least 3646 bytes.

At this point, I had a potential savings of just 74 bytes over the current compressor, even with very optimistic guesses at support code size. If used only as a drop-in replacement for existing compression, that would not be a significant enough savings to justify further exploration.

That said, it does still have the advantage of arbitrary characters, meaning fully "graphical" screens could be stored in text blocks. It would also allow for some compressibility in room background data, which currently occupies a total of 1120 bytes in the flagship project. Naively multiplying that by our ideal ratio of 3069:5267 says it could maybe save us another 468 bytes, but that would require a heafty refactor.

For now, though, it just isn't worth ¯\_(ツ)_/¯