Possible Kosinski+ improvements

Clownacy commented 6 years ago

I think I have a few ideas for changing Kosinski+'s format. Most increase performance without affecting size, but one improves size at the cost of performance.

For the size improvement, I suggest implementing Saxman's zero-fill mode. Normally this only saves a few bytes per file, but LZ's chunks see an improvement of 0x300 bytes, so it does have its moments. I've thought about extending this, so that it can encode runs of more numbers than zero. The way it would be done is by calculating how much the start of the dictionary is overshot. So, for example, if the distance reaches past the start of the dictionary by 2 bytes, it would encode a series of 1's, and if it's 3 bytes, it would write 2's, etc. However, I'm not sure how to implement this in the compressor, so I've been unable to test it.

The performance improvements are to do with making things easier for the decompressor. Here's the one I'm using.

Firstly, the high/low bytes in the long dictionary references matches are swapped, allowing the decompressor to do this:

    move.b  (a0)+,d2                ; d2 = %HHHHHCCC.
    move.b  d2,d3                   ; d3 = %11111111 HHHHHCCC.
    lsl.w   #5,d3                   ; d3 = %111HHHHH CCC00000.
    move.b  (a0)+,d3                ; d3 = %111HHHHH LLLLLLLL.

...instead of this:

        move.b  (a0)+,d6                                ; d6 = %LLLLLLLL.
        move.b  (a0)+,d4                                ; d4 = %HHHHHCCC.
        move.b  d4,d5                                   ; d5 = %11111111 HHHHHCCC.
        lsl.w   #5,d5                                   ; d5 = %111HHHHH CCC00000.
        move.b  d6,d5                                   ; d5 = %111HHHHH LLLLLLLL.

Next, the length bits in the long dictionary references are negated. This lets the decompressor skip the eor. This requires the jmp .mediumcopy(pc,d4.w) be changed to jmp .mediumcopy-2(pc,d4.w), since 0 is still reserved for triggering extended mode.

The last change is more of a size improvement for the decompressor than anything: if you write the distance byte before writing the length bits, you can add the displacement to a5 before jumping to the various .CopyXX blocks. It also lets you neatly collapse those blocks into one bit of code:


    ; Code 00 (Dictionary ref. short).
    _Kos_RunBitStream
    _Kos_ReadBit
    bcs.s   .Copy45
    _Kos_RunBitStream
    _Kos_ReadBit
    bcs.s   .Copy3
    _Kos_RunBitStream
    move.b  (a0)+,d5                ; d5 = displacement.
    adda.w  d5,a5
    move.b  (a5)+,(a1)+
    move.b  (a5)+,(a1)+
    bra.s   .FetchNewCode
; ---------------------------------------------------------------------------
.Copy3:
    _Kos_RunBitStream
    move.b  (a0)+,d5                ; d5 = displacement.
    adda.w  d5,a5
    move.b  (a5)+,(a1)+
    move.b  (a5)+,(a1)+
    move.b  (a5)+,(a1)+
    bra.w   .FetchNewCode
; ---------------------------------------------------------------------------
.Copy45:
    _Kos_RunBitStream
    _Kos_ReadBit
    bcs.s   .Copy5
    _Kos_RunBitStream
    move.b  (a0)+,d5                ; d5 = displacement.
    adda.w  d5,a5
    move.b  (a5)+,(a1)+
    move.b  (a5)+,(a1)+
    move.b  (a5)+,(a1)+
    move.b  (a5)+,(a1)+
    bra.w   .FetchNewCode
; ---------------------------------------------------------------------------
.Copy5:
    _Kos_RunBitStream
    move.b  (a0)+,d5                ; d5 = displacement.
    adda.w  d5,a5
    move.b  (a5)+,(a1)+
    move.b  (a5)+,(a1)+
    move.b  (a5)+,(a1)+
    move.b  (a5)+,(a1)+
    move.b  (a5)+,(a1)+
    bra.w   .FetchNewCode

(Zero-fill code trimmed for brevity)

    ; Code 00 (Dictionary ref. short).
    move.b  (a0)+,d3                ; d3 = displacement.
    adda.w  d3,a5
    if blockNumber=6
    move.b  (a0)+,d0
    endif
    add.b   d0,d0
    bcc.s   .Copy23
    move.b  (a5)+,(a1)+
    move.b  (a5)+,(a1)+
.Copy23:
    if blockNumber=5
    move.b  (a0)+,d0
    endif
    add.b   d0,d0
    bcc.s   .Copy2
    move.b  (a5)+,(a1)+
.Copy2:
    move.b  (a5)+,(a1)+
    move.b  (a5)+,(a1)+
    bra.w   fourBlocksAhead.FetchNewCode

flamewing commented 6 years ago

I find it doubtful that adding a Saxman-style fill will save that much space; the compressor should be able to compress such sequences by inlining the first byte of the sequence, then doing an overlapping match for the rest. The space cost would be 9 bits compared to the Saxman fill.

The two format changes that make things easier for the decompressor are interesting, and I will be looking into them when I come back.

The decompressor size improvement, I fear it might not possible; I want to say that there is no way to guarantee that the length bits are written after the distance byte, but I will have to think more carefully.

Remind me in a week or so when I am back from my vacation.

Clownacy commented 6 years ago

In my modified compressor, I just shuffled these around:

    // Inline dictionary match.
    out.descbit(0);
    out.descbit(0);
    out.putbyte(-dist);
    len -= 2;
    out.descbit((len >> 1) & 1);
    out.descbit(len & 1);
    break;

I don't see how a byte/bit ordering issue would happen.

flamewing commented 6 years ago

After thinking about quite a bit, I agree — what you describe will work. Working on it now.

flamewing commented 6 years ago

For reference, I implemented the Saxman-style zero fill (and your suggested general byte fill) for testing. As I suspected, the gains were less than stellar:

$ kosplus LZ.unc LZ.kospz
$ stat -c '%-12n %s' LZ.*
LZ.kos       9414
LZ.kosp      9414
LZ.kospz     9410
LZ.orig.kos  10224
LZ.unc       41984

For a grand total of 4 bytes in LZ's chunks compared to simply using the original encoder. I did get a much better result at first when using the Saxman code directly because I forgot to set the proper edge weight.

Clownacy commented 6 years ago

Strange. I changed the edge weight handling too, like this:

constexpr static void extra_matches(stream_t const *data,
                    size_t basenode,
                    size_t ubound, size_t lbound,
                    LZSSGraph<KosPlusAdaptor>::MatchVector &matches) noexcept {
    ignore_unused_variable_warning(lbound);
    // Can't encode zero match after this point.
    if (basenode >= (SearchBufSize - 1)) {
        return;
    }
    // Try matching zeroes.
    size_t jj = 0;
    while (data[basenode + jj] == 0) {
        if (++jj >= ubound) {
            break;
        }
    }
    // Need at least 2 zeroes in sequence.
    if (jj >= 2) {
        // Got them, so add them to the list.
        matches[jj - 1] = AdjListNode(basenode + jj,
                          basenode + 1,
                          jj, dictionary_weight(basenode + 1, jj));
    }
}

The file seemed to load fine afterwards.

flamewing commented 6 years ago

Just to confirm: you are testing with S1's map256/LZ.bin file when you say you got a 0x300-byte improvement, right? I tried with basically your code above, but adapted to the newer version of the code, and I got only a 4-byte improvement over stock Kos+.

Clownacy commented 6 years ago

...Dammit. I was using a .kosp file I compressed with KENSSharp, which doesn't use LZSS graphs. Sorry.

flamewing commented 6 years ago

No problems. I guess I will close this for now, and leave the byte fill for another, related, format.

flamewing / mdcomp

Possible Kosinski+ improvements #5