BitBlt rgbMul is broken for depth 1 and not correctly rounded above depth 4

nicolas-cellier-aka-nice commented 2 years ago

In depth 1, the resulting bits will always be 0.
It's not a big problem because rgbMul is just a bitAnd operation at this depth.
So a quick workaround would be to detect the case in BitBltSimulation

destDepth = 1 ifTrue: [^self bitAnd: sourceWord with: destinationWord].

That would also accelerate the Bit BLock Transfer operation, so it's a good hack.

But there is more. What we want is multiply ratios in interval [0,1].

dstRatio * srcRatio

Our implementation is scaled ratio (scaled by 1 << nBits - 1):

src := (srcRatio * scale) rounded.
dst := (dstRatio * scale) rounded.

So what we want is:

((dst/scale) * (src/scale) * scale) rounded

that is:

(dst*src / (1<<nBits-1)) rounded

Unfortunately, that's the other grief with the current implementation used for rounding:

(dst+1)*(src+1) - 1 >> nBits

It only equals correctly rounded operation for depths 2 and 4.

For rounding we might use:

(((dst/scale) * (src/scale) + 0.5) * scale) truncated.

that is expressed with truncated division:

dst*src + (scale+1//2) // scale

So here is a nicer formulation for doing the job at any depth (including 5bits rgb channels for 16 bits depth) with correctly rounded division:

aux := src * dst + (1 << (nBits - 1)). "add mid-scale for rounding"
result := aux << nBits + aux << nBits. "divide by scale"

This is because instead of dividing by scale, we can multiply by shifted inverse (sort of double precision), then shift right.

(2 to: 32) allSatisfy: [:nBits | (1 << (nBits * 2) / (1 << nBits - 1)) rounded = (1 << nBits + 1)].

Multiplying by this inverse is easy and cheap:

x * (1 << nBits + 1) = (x << nBits + x).

And then applying the right shift >> (2 * nBits) is equivalent to:

x >> nBits + x >> nBits.

We must first add 0.5 (scaled), that is src * dst + (1 << (nBits -1)) - our formulation of aux, and we're done.

We verify:

{
    (0 to: 1<<20-1) allSatisfy: [:i | (1<<9+i)>>10+ (1<<9+i)>>10 = (i/1023) rounded].
    (0 to: 1<<18-1) allSatisfy: [:i | (1<<8+i)>>9+ (1<<8+i)>>9 = (i/511) rounded].
    (0 to: 1<<16-1) allSatisfy: [:i | (1<<7+i)>>8+ (1<<7+i)>>8 = (i/255) rounded].
    (0 to: 1<<14-1) allSatisfy: [:i | (1<<6+i)>>7+ (1<<6+i)>>7 = (i/127) rounded].
    (0 to: 1<<12-1) allSatisfy: [:i | (1<<5+i)>>6+ (1<<5+i)>>6 = (i/63) rounded].
    (0 to: 1<<10-1) allSatisfy: [:i | (1<<4+i)>>5+ (1<<4+i)>>5 = (i/31) rounded].
    (0 to: 1<<8-1) allSatisfy: [:i |  (1<<3+i)>>4+ (1<<3+i)>>4 = (i/15) rounded].
    (0 to: 1<<6-1) allSatisfy: [:i |  (1<<2+i)>>3+ (1<<2+i)>>3 = (i/7) rounded].
    (0 to: 1<<4-1) allSatisfy: [:i |  (1<<1+i)>>2+ (1<<1+i)>>2 = (i/3) rounded].
} allSatisfy: #yourself.

The nice thing is that above down-scaling operation can be multiplexed.
Suppose that we have p groups of 2*nBits M holding square-scale multiplication of each channel concatenated in a double-Word-Mul.

doubleWordMul = Mp .... M5 M3 M1

Note we arrange to have odd channels in low word, and even channels in high word.

We first form a groupMask on a word with (p+1)/2 groups of nBits alternating all one i and all zero o, oioi...ioi.

channelMask := 1 << nBits - 1.
groupMask := 0.
0 to: wordBits // (2 * nBits) do: [:i |
    groupMask = groupMask << (2 * nBits) + channelMask].

Where wordBits is the number of bits in a word (usually we want to operate on 32 bits words in BitBlt).

We form the doubleGroupMask on a double-word with p groups of 2*nBits oi:

doubleGroupMask := groupMask.
doubleGroupMask := doubleGroupMask << wordBits + groupMask.

And we perform the division by scale:

doubleWordMul := (doubleWordMul >> nBits bitAnd: doubleGroupMask) + doubleWordMul >> nBits bitAnd: doubleGroupMask.

At this stage we obtain a double word containing scaled multiplicands interleaved with groups of nBits zeros:

o mp ... o m3 o m1

Now the final result can be obtained by shifting back:

doubleWordMul >> (wordBits - nBits) + (doubleWordMul bitAnd: groupMask)

The only problem remaining is how to obtain the squared-scale multiplicands. It would be easy to form the alternate even-odd channels for each src and dst operands:

doubleWordSrc := src >> nBits bitAnd: groupMask.
doubleWordSrc := doubleWordSrc << wordBits + (src bitAnd: groupMask).
doubleWordDst := dst >> nBits bitAnd: groupMask.
doubleWordDst := doubleWordDst << wordBits + (dst bitAnd: groupMask).

we now get o sp ... o s3 o s1 and 0 dp ... o d3 o d1, but we would now need a SIMD integer multiplication operating on groups of 2*nBits in parallel... We don't have that, at least in portable C code. So we still have to emulate it with a loop.

half := 1 << (nBits - 1).
shift := 0.
doubleWordMul  := 0
0 to: nChannels - 1 do: [:i |
    doubleWordMul := doubleWordMul + (((doubleWordSrc >> shift bitAnd: channelMask) * (doubleWordSrc >> shift bitAnd: channelMask) + half) << shift).
    shift := shift + nBits + nBits].

We know that each operation cannot overflow on upper neighbour group of 2*nBits, because the maximum value is:

(1<<nBits-1) squared + (1 << (nBits-1)) = 1 << (2*nBits) - (2*(1<<nBits)) + (1 << (nBits-1)) - 1
    < (1 << (2*nBits) - 1)

It remains the odd case of 16 bits depth, which has 3 groups of 5 bits and a leading zero. I believe that above algorithm works without splitting in two half-words... To be tested.

We have gathered the pieces for a correctly rounded almost-multiplexed rgbMul.
Somehow have our cake and eat it too.

nicolas-cellier-aka-nice commented 2 years ago

This is implemented here: https://source.squeak.org/VMMaker/VMMaker.oscog-nice.3251.diff

The algorithm is slightly modified to cope with 5-bits RGB channels in 16bits depth.
We still have to handle that case in two half-words and care for the shift of odd channels (green) in the upper half-word:
it must leave enough room (nBits) for double precision red1*red2, which will leak in upper half-word...

Exhaustive tests are also provided for each and every channel combination.

Note that many operations (forming the masks) are repeated at each word of the bitmap, so we could probably accelerate things with additional static variables...

nicolas-cellier-aka-nice commented 2 years ago

If we wanted to handle 16bits depth without split (thus 6 channels in parallel on a 32bits word), that would get a bit tricky due to the dead-bit. We would need two different masks, because shifting a mask does not produce the other one:

lowMask :=  2r00000011111000000111110000011111. "green2-red1-blue1"
highMask := 2r01111100000111110000001111100000. "red2-blue2-green1"
highWordShift := 27.
doubleGroupMask := 2r0000001111100000011111000001111100000011111000000111110000011111. "highMask << highWordShift + lowMask"
doubleWord1 := word1 bitAnd: highMask.
doubleWord2 := word2 bitAnd: highMask.
doubleWord1 := doubleWord1 << highWordShift + (word1 bitAnd: lowMask).
doubleWord2 := doubleWord2 << highWordShift + (word2 bitAnd: lowMask).

Then the shifts for accessing each component in double word would be tricky, either 10 or 11 in the loop (0 10 21 32 42 53).

extraShift := 2r10110.
shift := 0.
0 to: 5 do: [:i |
    doubleWordMul := doubleWordMul + (((doubleWordSrc >> shift bitAnd: channelMask) * (doubleWordSrc >> shift bitAnd: channelMask) + half) << shift).
    shift := shift + (2 * nBits) + (extraShift >> i bitAnd: 1)].

The rest should work unchanged.

marceltaeumel commented 1 year ago

See 561d57da08901d946bda86a5660be16e7c441841

OpenSmalltalk / opensmalltalk-vm

BitBlt rgbMul is broken for depth 1 and not correctly rounded above depth 4 #651