I need a kernel like this for the llm.c GPT2 implementation, and the kernel code itself accidentally seemed to look decent (if I'm reading the below right theoretically up to 50% efficiency of vector instructions?). Maybe at least a good starting point to start optimizing from.
Adds a row-vector
bias
to an input matrix.I need a kernel like this for the llm.c GPT2 implementation, and the kernel code itself accidentally seemed to look decent (if I'm reading the below right theoretically up to 50% efficiency of vector instructions?). Maybe at least a good starting point to start optimizing from.