Open kimstik opened 6 months ago
Yes, this is to scale the numbers in a way, where the distance between the first positive and negative element is the same as between the first and second.
e.g. the encoding for two bit is -3 -1 1 3 or -1.5, -0.5, 0.5, 1.5 with constant delta.
With twos complements the encoding would be assymetric, -2,1,0,1 With ones complement without the shift, you would get a double 0: -1,0, 0,1 or a double step around zero: -2,-1,1,2
The encoding is more regular that way and there are no redundant codes. The weights are centered symetrically around zero for most layers, so you don't want to have a huge gap there.
The impact of this encoding network capacity and loss minimal, though. QaT is able to redistribute quantization errors fascinatingly well.
Did you notice that Clang generates linear code without jumps (at the cost of a slight size increase)? They seem to mention static branch prediction. It's interesting to find out how expensive conditional branches are in QingKe RISC-V2A finally ...
I didn't check Clang specificially, but there were quite some differences in code generation between optimization options. To be honest, I thought that the RV32EC code was a bit cumbersome. The only bit that could be checked without masking is the MSB.
In the end, the biggest impact on execution speed was moving the inference code to SRAM, though.
Interesting. So, this device has both flash memory and a processor on a single chip? That's unusual for modern Chinese designs. The common trend there is to use two-chip solutions, allowing for faster operation without wait states and insane overclocking capabilities.
I achieved branchless linear code by processing two activations simultaneously in GCC:
// W packed two 4bit into 8bits by interleaving: (S1,D12,D11,D10),(S1,D12,D11,D10) -> (S1,S2,D10,D20,D11,D22,D12,D22)
static inline int vmac8x4(uint32_t *act, int32_t W) {
static const int M[4] = { 0x00000000, 0x00000fff, 0x0fff0000, 0x0fff0fff, };
static const int ONE[4] = { 0x00000000, 0x00000001, 0x00010000, 0x00010001, }; //FIXME: perhaps we may get rid of this part for sign application
int acc = 0;
for (uint32_t j = 0; j < 2; j++) {
uint32_t in = *act++;
for (uint32_t k = 0; k < 2; k++) {
int tmp = (M[(W>>(32-2))&0x3] ^ (in & 0x00ff00ff)) + ONE[(W>>(32-2))&0x3]; // process bytes 1/3
acc += tmp;
tmp<<=1; acc += tmp ^ M[(W>>(32-2))&0x3];
tmp<<=1; acc += tmp ^ M[(W>>(32-4))&0x3];
tmp<<=1; acc += tmp ^ M[(W>>(32-6))&0x3];
W <<=8;
in >>=8; // repeat for bytes 0/2
}
}
return (acc + (acc>>16))&0xfff;
}
Just a prototype. It needs to be tested.
Interesting. So, this device has both flash memory and a processor on a single chip? That's unusual for modern Chinese designs. The common trend there is to use two-chip solutions, allowing for faster operation without wait states and insane overclocking capabilities.
Indeed, see here. https://cpldcpu.wordpress.com/2024/05/01/decapsulating-the-ch32v203-reveals-a-separate-flash-die/
Quite surprised to see this even in a $0.30 MCU. But the Ch32v003 is monolithic.
I achieved branchless linear code by processing two activations simultaneously in GCC:
Nice! I thought about this, but never went through actually doing this. I wonder how well RV32EC deals with the lookup tables?
Btw, I don't know wether you saw the FP1.3.0 encoding: https://github.com/cpldcpu/BitNetMCU/blob/e730e598b994379426927259ad8ccf22c7311ce7/BitNetMCU_inference.c#L129
Need to write up something about it. I almost managed to get it to the same accuracy as the 4 bit encoding. It's insane that it works so well.
This may work even better when parallelized.
nonlinear FP1.3.0 is nice, but running it in parallel isn't looks straightforward yet. It's strange that you didn't give it much attention, as it uses only 15 out of 16 states. There are shifts of +0/-0 present.
2-bit parallel multiplication should be very interesting.
It's strange that you didn't give it much attention, as it uses only 15 out of 16 states. There are shifts of +0/-0 present.
For an exponent of zero, the sign encodes to +1 and -1, so all states are used. The entropy of the weights is still not optimal, after the training, not all codes are used by the model in all layers.
8 instructions per iteration of FP1.3.0 is so beautiful:
.L28:
slli t1,a4,8
sll a2,s1,a2
srai a1,t1,28
lb s1,2(a3)
add a2,a2,s0
andi a1,a1,7
bge t1,zero,.L29
neg s1,s1
.L29:
slli s0,a4,12
sll a1,s1,a1
srai t1,s0,28
add a1,a1,a2
lb s1,3(a3)
andi a2,t1,7
bge s0,zero,.L30
neg s1,s1
.L30:
slli s0,a4,16
sll a2,s1,a2
srai t1,s0,28
lb s1,4(a3)
add a2,a2,a1
andi t1,t1,7
bge s0,zero,.L31
neg s1,s1
.L31:
I managed to get a 4-in-1 activation convolution working with 5-bit input quantization. However, it provides almost no performance benefit ;((( It still takes around 106 instructions for 8 inputs.
it should be possible to parrallelize 4 bit quants with mul? 2 or 4 at a time?
Yes, with mul it should be much more simple. would you like to shift to chip with mul?
The successors of the CH32V003 (V002 etc) will all have mul.
Ough! I missed it! Is V2C mul is 1-cycle? Hope next successor will add Zcmp
I think it is 1 cycle. They only added the mul instruction, nothing else.
Should be possible to implement SIMD processing. For example 6 bit activations and 2 bit weights with 4 in parallel. The weights could be unpacked with an 8 bit table lookup into 32bit.
I'm having some concerns about the line of code 'sum += tmpsum;'. Is it redundant, perhaps? If the weight element is in SDDD format, then the sum of three shifts should be sufficient. Here's the relevant code section:
upd: It appears that I overlooked the fact that the value 0 is not used...