Add row-wise bias addition programming example

Adds a row-vector bias to an input matrix.

I need a kernel like this for the llm.c GPT2 implementation, and the kernel code itself accidentally seemed to look decent (if I'm reading the below right theoretically up to 50% efficiency of vector instructions?). Maybe at least a good starting point to start optimizing from.

.label ZLS_Frow_wise_bias_add_f32_f32_176
.loop_nesting 1
.begin_of_loop
.nohwbrkpt
.noswbrkpt
       176    0x05 0x95 0x00 0x00 0x02 0x05 0x95 0xc0 0x00 0x00 0x76 0x0b 0x68 0x06 0x00 0x24  NOPA;                         VLDB wh2, [p0, #32];          VST wl5, [p2], #64;           NOPX;                         VMOV bmh1, x2;                          VADD.f bml1, bmh1, bml0, r0
.nohwbrkpt
.noswbrkpt
       192    0x04 0x91 0x00 0x00 0x02 0x06 0x8f 0xc0 0x00 0x01 0x41 0x43 0x68 0x00 0x03 0xc0  NOPA;                         VLDB wl2, [p0], #64;          VST wh3, [p2, #32];           NOPX;                         VMOV x5, bmh2;                          NOPV
.nohwbrkpt
.noswbrkpt
       208    0x05 0xa5 0x00 0x00 0x02 0x05 0x8d 0xc0 0x00 0x00 0x96 0x13 0x68 0x08 0x02 0x54  NOPA;                         VLDB wh4, [p0, #32];          VST wl3, [p2], #64;           NOPX;                         VMOV bml2, x4;                          VADD.f bmh2, bml2, bmh0, r0
.label ZLE_Frow_wise_bias_add_f32_f32_224
.end_of_loop
.nohwbrkpt
.noswbrkpt
       224    0x04 0xa1 0x00 0x00 0x02 0x06 0x97 0xc0 0x00 0x00 0xc0 0x83 0x68 0x00 0x03 0xc0  NOPA;                         VLDB wl4, [p0], #64;          VST wh5, [p2, #32];           NOPX;                         VMOV x3, bml1;                          NOPV

Xilinx / mlir-aie

Add row-wise bias addition programming example #1596