Format documentation typo in LZSA2.

remy-luisant commented 3 years ago

Hello, Emmanuel.

Given the simplicity of LZSA2, I am using it as a replacement for the Heatshrink library. The role of the compression library is to assist in OS booting on a 32-bit ARMv7-A processor with megabytes of RAM, running at GHz speeds.

I am using a custom decoder that allows me to feed the decompressor byte-by-byte, just as is the case for Heatshrink. This has allowed me to cut on boot times since I can do something else while I wait for more compressed data to stream into my RAM, even if the decompression is no longer very fast. The bottleneck is the flash memory, which I can access in parallel to the decompression, so this was a worthwhile trade.

I have been relying on the GitHub documentation in creating my decoder. While doing so I have encountered this puzzle:

If the encoded match length is 7 or more, the 'M' bits in the token form the value 7, and an extra nibble is read:
0-14: the value is added to the ***3*** stored in the token, and then the minmatch of 2 is added, to compose the final match length.

Shouldn't this be 7? Or 3 bits?

Thank you for the hard work and for the beautiful file format.

emmanuel-marty commented 3 years ago

Hey,

Thank you so much for reporting this!

First of all, as you probably noticed, there is a reference C decoder for LZSA2 here: https://github.com/emmanuel-marty/lzsa/blob/master/src/expand_block_v2.c

The code may be a little unclear because it is designed for fast decompression (in retrospect, being a reference C decoder that will not actually be used in production for the most part - the asm decoders are - I should have written it to look much simpler - and I may yet rework it for that purpose), but hopefully it can help you in writing your decoder as well.

You found an embarassing mistake in the LZSA2 spec, which I think comes from copy-pasting text describing the literals length (which does max out to 3 in the token). So indeed, this should read:

0-14: the value is added to the 7 stored in the token, and then the minmatch of 2 is added, to compose the final match length.

Let me know if that works for you, and if you need any more help with your decoder. Once you have a working implementation, I am also happy to help test your code with synthetic workloads and whatnot :)

remy-luisant commented 3 years ago

Thanks, Emmanuel.

I am indeed aware of the reference decoder and it has been absolutely invaluable in getting my own decoding correct. It really helps to have a working implementation to compare against. I have found a few things to be cryptic, most notably copying out 8 bytes, immediately followed by copying out two bytes, followed by moving the write pointer only by the amount that was intended to be copied out in the first place. This wasn't a barrier to understanding, however.

I am currently successfully decoding a single 64kB block of firmware, currently implementing the detection of block boundaries in the stream format. I wish there was support for raw blocks of arbitrary sizes, but I guess that is both out of the scope of 8-bit computer decompression and would make encoding much more difficult.

While my goal is to sacrifice a lot for clarity and for the ability to feed the program one byte at a time, I am still happy to make my decoder public when it is done and cleaned up. I was already planning on making it open source, even if I doubt that many people would care about the exact very rare use case that I have.

Your offer of workloads and testing is greatly appreciated and I will gladly avail myself of it in a couple of days. What is the best way to contact you for that? (Since I doubt this issue is.)

Also, I would like to take the time to thank you for your approach to this. This is going above and beyond for a typo report. While I don't expect to need much assistance, the supportive approach has certainly improved my day. Thank you.

emmanuel-marty commented 3 years ago

My email is my lastname.my firstname@gmail.com if that's easier for you to communicate this way, whatever works best for you :) I am happy, if you wish, to help test out your implementation even if it's not open source (the overall zlib license for LZSA means: do absolutely whatever you like with it :-). That's, if you need extra testing of course.

I will fix the LZSA2 typo now. If the issue is resolved for you and your decoding works, you can close this issue, and we can talk by email or otherwise, but I am always happy to help. Since LZSA is mostly used for retrocomputing, I fully expect niche and obscure uses, and that's totally fine; a general implementation on modern hardware using a mainstream OS would probably use something like zstd anyway. LZSA has picked up use in 8-bit demos and games, so seeing it used to decode firmware on ARMv7 is really kind of neat!

emmanuel-marty commented 3 years ago

Whoops, committing the fix closed the issue, not my intent - obviously I will wait to hear whether everything is OK on your end!

remy-luisant commented 3 years ago

Everything is good, thank you. I will be in touch via email in a day or two.

emmanuel-marty / lzsa

Format documentation typo in LZSA2. #54