Possible way to improve Kosinski+'s speed

Clownacy commented 5 years ago

So I've been working on my own set of graph-based LZSS compressors, and recently added support for Chameleon - the compression used by Kid Chameleon and Sonic 2 Nick Arcade (only for GHZ's chunks).

Chameleon has a quirk that makes it impossible to support in your framework (at least it did a year or two ago, when I last tried): it stores its descriptors separately from the rest of the compressed data.

This gave me an idea: Comper uses 16-bit descriptors, and can read them with a single instruction thanks to everything else being 16 bits long. This isn't the case with Kosinski, which had to read each byte separately. However, bunching the descriptors together would guarantee that each one starts on an even address, allowing faster reads.

The downside to this is needing to store an offset to the end of the descriptor data in the compressed file (assuming the file starts with the descriptors, like Chameleon does), which in turn creates a file-size limit.

flamewing commented 5 years ago

Re: Chameleon: It should be easy to adapt the framework for this case. I could add a pair of classes that are selected in the format adaptor ("using stream_kind = xxx") and controls the behavior of LZSS(I|O)Stream:

one type gives current behavior: descriptors mixed with non-descriptor data;
the other type gives the proposed behavior: a stream of descriptors, followed by a stream of data.

LZSS(I|O)Stream already encapsulate a lot of this behavior: LZSSOstream, for example, has an internal buffer for non-descriptor data that is used until the current descriptor is filled. Then, the descriptor is written to the stream, and the buffer is flushed onto the stream. This one could be easily modified to write the descriptors into a buffer, and deferring the flushes to the stream until after all data has been written. LZSSIstream would require slightly more work.

But this work would be useful in any case: I am planning on adding some non-LZSS formats that would use LZSS(I|O)Stream (which have nothing exclusive to LZSS), and some of them could use a similar idea of splitting the streams. As a bonus, this could even be adapted to be usable by the S2 Special Stage mapping compression format (which could really use a name... and I don't want to put my name on it, despite being the one that figured it out).

Re: file size limit due to 2-byte offset: in the worst case scenario, all descriptor bits correspond to literals, and the maximum file size is 64kB. This is already equal to the total RAM size for the Mega Drive, and the file is actually bigger when compressed. For a more typical case, the maximum file size would be a lot bigger, because descriptors correspond to a lot more data.

I am skeptical that this would give a significant boost of speed for Kosinski+ because descriptor reads are infrequent compared to other reads: the best possible gains come from the worst possible scenario in terms of compression (every descriptor correspond to a literal byte). In this case, 1 descriptor byte is read every 8 literal bytes, so that the savings of reading 2 descriptor bytes in one instruction are 16 cycles (for one less move.b (a0)+,d0, one less moveq #7,d2, and one less non-branch at a dbra) out of every 608 cycles (two full cycles of descriptor fetches for literals). That is about 2.6% on this best/worst case. When you look at the dictionary matches, the gains are even more diluted by many more memory copies. It does, however, pay off the additional moveq #0,d0, move.w (a0)+,d0, lea (a0, d0.l),a2 at the start of the decoder for any file that needs more than 2 descriptor bytes.

Comper gets a benefit from 16-bit memory access because every memory operation it does is a word operation. Thus, the gains scale in direct proportion to the length of dictionary matches, instead of being diluted by it.

But there is one potential additional benefit that you did not consider: that of compression ratio. The tail end of the descriptor stream could potentially be equal to the start of the non-descriptor data stream. Thus, the starting offset could point to somewhere before the end of the descriptor stream, and save a few bytes. And it does not even have to point to a word, because the non-descriptor data is bytes anyway.

I will study to see if there is a benefit from this in terms of compression ratio for Kosinski+.

But as for extending the framework to be able to handle separate streams for descriptors and data, it is a good idea and I will definitely do it. Possibly at the same time I optimize the whole thing for speed by getting rid of most of the c++ stream APIs for the internal compression classes.

flamewing commented 3 years ago

One more advantage of splitting the descriptor and data streams which I did not realize before: because there is an explicit end-of-stream code in the format, the compressor can treat the descriptor stream as being composed of bytes, and the 68k decompressor can just pretend that the descriptor stream is composed of words/longwords. This means that the descriptor stream will not have wasted bytes even if there is no useful overlap between the end of the descriptor stream and the start of the data stream.

flamewing commented 3 years ago

Another way to increase speed without a format change is @vladikcomper's RLE mode from ComperX (pull #21): if the displament is -1, cache the value into a register (say, d0) and use move.b d0,(a1)+ instead of re-reading it.

A tweak on that might be to use movep.l/movep.w for longer copies and save the additional 4 cycles from instruction prefetch for the cost of the prefetch of the displacement. d0 could be filled in like this:

movep.w 0(a1),d0
move.b (a1)+,d0
move.w d0,d1
swap d0
move.w d1,d0

And the unrolled loop could be something like:

movep.l d0,0(a1)
movep.l d0,1(a1)
movep.l d0,8(a1)
movep.l d0,9(a1)
movep.l d0,16(a1)
movep.l d0,17(a1)
movep.l d0,24(a1)
movep.l d0,25(a1)
lea 32(a1),a1

with a suitable Duff's device before the loop body.

flamewing commented 2 years ago

Blocked by #25

Also, I think I will make this into a new format and call it KosX. I will also be appropriating some ideas from ComperX, namely the recognition that copying from an offset of -1 can be done faster, and probably changing the way offset -1 is encoded for faster checking.

flamewing / mdcomp

Possible way to improve Kosinski+'s speed #8