emmanuel-marty / lzsa

Byte-aligned, efficient lossless packer that is optimized for fast decompression on 8-bit micros
Other
233 stars 30 forks source link

Faster 6502 decompressor & suggestions for possible format changes. #37

Closed jbrandwood closed 4 years ago

jbrandwood commented 4 years ago

Hi, I've been trying out LZSA and I am really impressed with the work that you've done!

I have written a 6502 LZSA2 decompressor that seems to be a bit more efficient than Peter Ferrie's code that you are currently distributing.

I've also identified a couple of potential changes that you might consider making to the LZSA2 format to make it decompress a bit better on the 6502.

Here are some test results (from running the 6502 code on a HuC6280 which has slightly different instruction timings to a standard 6502) ...

lzsa2: Peter Ferrie's 6502 FAST decompression (code 253 bytes long) of gpl-3.0.txt (35147 bytes decompressed)

2723148 cycles : 77 cycles per byte decompressed.

lzsa2: 6502 elmer-standard decompression (code 252 bytes long, 1 bytes shorter) of gpl-3.0.txt (35147 bytes decompressed)

1937239 cycles with LZSA_BEST_SIZE==1 and LZSA_SLOW_NIBL==1 (29% faster than Peter Ferrie's code).

lzse2: 6502 elmer-modified decompression (code 243 bytes long, 10 bytes shorter) of gpl-3.0.txt (35147 bytes decompressed)

1912587 cycles with LZSA_BEST_SIZE==1 and LZSA_SLOW_NIBL==1 (30% faster than Peter Ferrie's code).

24652 cycles saved by changing format.

lzsa2: 6502 elmer-standard decompression (code 267 bytes long, 14 bytes longer) of gpl-3.0.txt (35147 bytes decompressed)

1813175 cycles with LZSA_BEST_SIZE==0 and LZSA_SLOW_NIBL==1 (33% faster than Peter Ferrie's code).

lzse2: 6502 elmer-modified decompression (code 258 bytes long, 5 bytes longer) of gpl-3.0.txt (35147 bytes decompressed)

1788523 cycles with LZSA_BEST_SIZE==0 and LZSA_SLOW_NIBL==1 (35% faster than Peter Ferrie's code).

1788523 cycles : 50 cycles per byte decompressed.

24652 cycles saved by changing format.

For comparison ...

aplib 6502 elmer-standard decompression (code 256 bytes long) of gpl-3.0.txt (35147 bytes decompressed)

2668144 cycles : 75 cycles per byte decompressed.

I'm attaching the decompressor here, and I have checked it into my fork of lzsa, together with the potential changes to the compressor that I wish to bring to your attention.

Note: Of the two potential changes, one seems like it wouldn't cause any slowdown in decompressors on other targets, but the second change probably would.

I can't currently see any simple and elegant way to make your compressor do a platform-specific change to the output format at runtime, but I don't consider that the changes that I propose could justify calling them a separate format.

Anyway, thanks for your consideration!

lzsa2_6502.txt

emmanuel-marty commented 4 years ago

Thank you so much again for the kind words!

I quickly adapted your LZSA2 decompressor for the ACME assembler syntax and checked it in as decompress_faster_v2.asm, hopefully that's fine. A ~30% speed gain is HUGE and will be of interest to everyone, as it would let LZSA2 be used for tasks where only LZSA1 or LZ4 was usable before, such as real-time decompression of precalc data in demos.

I don't think the ~2% further gain is worth breaking compatibility, however, I have a couple of other changes (that provide a free, small ratio increase) that would also break compatibility, as well as more involved changes (not free in decompression time, but would provide a bigger ratio increase) that might all be all worth rolling into one incompatible format extension or new format altogether. In the meantime, I'm happy for you to use LZSA2 in HuC and you should definitely apply these changes locally, as you will control both ends. I don't have changes queued to shrink_block_v2.c and if I do, I'll be happy to be mindful of your local changes so that they merge cleanly.

I am however also happy to provide a window option to LZSA (once we worked that out with apultra, I'll know what you expect) as that would most likely result in a much larger speed boost and won't break anything for other CPUs.

Are you 'the' John Brandwood of Gryzor/Renegade/Ocean fame? If so, it's a big honor :-)

jbrandwood commented 4 years ago

Hi, I've continued to work on the LZSA2 decompressor, and have removed another 11 bytes of code at only a minimal cost in speed (it is still 30+% faster that Peter Ferrie's FAST code, but now it is only 3 bytes longer). The SMALL version is also smaller, now smaller than the current SMALL 6502 code, but it is still 25% faster than the FAST 6502 code.

I have also written an LZSA1 decompressor that helps to restore the difference between LZSA1 and LZSA2. ;-)

lzsa1 test with current SMALL code ... 172 bytes long, takes 2,909,491 cycles. lzsa1 test with current FAST code ... 307 bytes long, takes 2,728,523 cycles. lzsa1 test with new SMALL code ... 168 bytes long, takes 2,159,229 cycles. lzsa1 test with new FAST code ... 205 bytes long, takes 1,545,259 cycles.

lzsa2 test with new FAST code ... 256 bytes long, takes 2,248,796 cycles.

I'm attaching both of the latest decompressors to this message.

Happy New Year!

P.S. Yes, I am "that" John Brandwood ... it's really nice that so many people still have fond memories of the work that I did so very, very long ago now! :-)

lzsa-6502-2019-12-31.zip

emmanuel-marty commented 4 years ago

Wow, thank you, that's fantastic work! I checked in the updated, smaller LZSA2 depacker and your new LZSA1 decompressor as well, quickly converted to ACME syntax. They work great on C64 as well, this should hopefully be useful to many.

Happy New Year as well!