cpldcpu / BitNetMCU

Neural Networks with low bit weights on low end 32 bit microcontrollers such as the CH32V003 RISC-V Microcontroller and others
GNU General Public License v3.0
243 stars 23 forks source link

short multiplication #2

Open kimstik opened 6 months ago

kimstik commented 6 months ago
        for (uint32_t j = 0; j < 8; j++) {
            int32_t in=*activations_idx++;
            int32_t tmpsum = (weightChunk & 0x80000000) ? -in : in; 
            sum += tmpsum;                                  // sign*in*1
            if (weightChunk & 0x40000000) sum += tmpsum<<3; // sign*in*8
            if (weightChunk & 0x20000000) sum += tmpsum<<2; // sign*in*4
            if (weightChunk & 0x10000000) sum += tmpsum<<1; // sign*in*2
            weightChunk <<= 4;
        }

I'm having some concerns about the line of code 'sum += tmpsum;'. Is it redundant, perhaps? If the weight element is in SDDD format, then the sum of three shifts should be sufficient. Here's the relevant code section:

            int32_t tmpsum = (weightChunk & 0x80000000) ? -in : in; 
            if (weightChunk & 0x10000000) sum += tmpsum; // sign*in*1
            if (weightChunk & 0x20000000) sum += tmpsum<<1; // sign*in*2
            if (weightChunk & 0x40000000) sum += tmpsum<<2; // sign*in*3
            weightChunk <<= 4;

upd: It appears that I overlooked the fact that the value 0 is not used...

cpldcpu commented 6 months ago

Yes, this is to scale the numbers in a way, where the distance between the first positive and negative element is the same as between the first and second.

e.g. the encoding for two bit is -3 -1 1 3 or -1.5, -0.5, 0.5, 1.5 with constant delta.

With twos complements the encoding would be assymetric, -2,1,0,1 With ones complement without the shift, you would get a double 0: -1,0, 0,1 or a double step around zero: -2,-1,1,2

The encoding is more regular that way and there are no redundant codes. The weights are centered symetrically around zero for most layers, so you don't want to have a huge gap there.

The impact of this encoding network capacity and loss minimal, though. QaT is able to redistribute quantization errors fascinatingly well.

kimstik commented 6 months ago

Did you notice that Clang generates linear code without jumps (at the cost of a slight size increase)? They seem to mention static branch prediction. It's interesting to find out how expensive conditional branches are in QingKe RISC-V2A finally ...

cpldcpu commented 6 months ago

I didn't check Clang specificially, but there were quite some differences in code generation between optimization options. To be honest, I thought that the RV32EC code was a bit cumbersome. The only bit that could be checked without masking is the MSB.

In the end, the biggest impact on execution speed was moving the inference code to SRAM, though.

kimstik commented 6 months ago

Interesting. So, this device has both flash memory and a processor on a single chip? That's unusual for modern Chinese designs. The common trend there is to use two-chip solutions, allowing for faster operation without wait states and insane overclocking capabilities.

kimstik commented 6 months ago

I achieved branchless linear code by processing two activations simultaneously in GCC:

// W packed two 4bit into 8bits by interleaving: (S1,D12,D11,D10),(S1,D12,D11,D10) -> (S1,S2,D10,D20,D11,D22,D12,D22)
static inline int vmac8x4(uint32_t *act, int32_t W) {
    static const int M[4]   = { 0x00000000, 0x00000fff, 0x0fff0000, 0x0fff0fff, };
    static const int ONE[4] = { 0x00000000, 0x00000001, 0x00010000, 0x00010001, };  //FIXME: perhaps we may get rid of this part for sign application
    int acc = 0;
    for (uint32_t j = 0; j < 2; j++) {
        uint32_t in = *act++;
        for (uint32_t k = 0; k < 2; k++) {
            int tmp = (M[(W>>(32-2))&0x3] ^ (in & 0x00ff00ff)) + ONE[(W>>(32-2))&0x3];    //    process bytes 1/3
                        acc += tmp;
            tmp<<=1;    acc += tmp ^ M[(W>>(32-2))&0x3];
            tmp<<=1;    acc += tmp ^ M[(W>>(32-4))&0x3];
            tmp<<=1;    acc += tmp ^ M[(W>>(32-6))&0x3];
            W  <<=8;
            in >>=8;    // repeat for bytes 0/2
        }
    }
    return (acc + (acc>>16))&0xfff;
}

Just a prototype. It needs to be tested.

cpldcpu commented 6 months ago

Interesting. So, this device has both flash memory and a processor on a single chip? That's unusual for modern Chinese designs. The common trend there is to use two-chip solutions, allowing for faster operation without wait states and insane overclocking capabilities.

Indeed, see here. https://cpldcpu.wordpress.com/2024/05/01/decapsulating-the-ch32v203-reveals-a-separate-flash-die/

Quite surprised to see this even in a $0.30 MCU. But the Ch32v003 is monolithic.

cpldcpu commented 6 months ago

I achieved branchless linear code by processing two activations simultaneously in GCC:

Nice! I thought about this, but never went through actually doing this. I wonder how well RV32EC deals with the lookup tables?

Btw, I don't know wether you saw the FP1.3.0 encoding: https://github.com/cpldcpu/BitNetMCU/blob/e730e598b994379426927259ad8ccf22c7311ce7/BitNetMCU_inference.c#L129

Need to write up something about it. I almost managed to get it to the same accuracy as the 4 bit encoding. It's insane that it works so well.

This may work even better when parallelized.

kimstik commented 6 months ago

nonlinear FP1.3.0 is nice, but running it in parallel isn't looks straightforward yet. It's strange that you didn't give it much attention, as it uses only 15 out of 16 states. There are shifts of +0/-0 present.

2-bit parallel multiplication should be very interesting.

cpldcpu commented 6 months ago

It's strange that you didn't give it much attention, as it uses only 15 out of 16 states. There are shifts of +0/-0 present.

For an exponent of zero, the sign encodes to +1 and -1, so all states are used. The entropy of the weights is still not optimal, after the training, not all codes are used by the model in all layers.

kimstik commented 6 months ago

8 instructions per iteration of FP1.3.0 is so beautiful:

.L28:
    slli    t1,a4,8
    sll a2,s1,a2
    srai    a1,t1,28
    lb  s1,2(a3)
    add a2,a2,s0
    andi    a1,a1,7
    bge t1,zero,.L29
    neg s1,s1
.L29:
    slli    s0,a4,12
    sll a1,s1,a1
    srai    t1,s0,28
    add a1,a1,a2
    lb  s1,3(a3)
    andi    a2,t1,7
    bge s0,zero,.L30
    neg s1,s1
.L30:
    slli    s0,a4,16
    sll a2,s1,a2
    srai    t1,s0,28
    lb  s1,4(a3)
    add a2,a2,a1
    andi    t1,t1,7
    bge s0,zero,.L31
    neg s1,s1
.L31:
kimstik commented 3 months ago

I managed to get a 4-in-1 activation convolution working with 5-bit input quantization. However, it provides almost no performance benefit ;((( It still takes around 106 instructions for 8 inputs.

cpldcpu commented 3 months ago

it should be possible to parrallelize 4 bit quants with mul? 2 or 4 at a time?

kimstik commented 3 months ago

Yes, with mul it should be much more simple. would you like to shift to chip with mul?

cpldcpu commented 3 months ago

The successors of the CH32V003 (V002 etc) will all have mul.

kimstik commented 3 months ago

Ough! I missed it! Is V2C mul is 1-cycle? Hope next successor will add Zcmp

cpldcpu commented 3 months ago

I think it is 1 cycle. They only added the mul instruction, nothing else.

cpldcpu commented 3 months ago

Should be possible to implement SIMD processing. For example 6 bit activations and 2 bit weights with 4 in parallel. The weights could be unpacked with an 8 bit table lookup into 32bit.