#1 performance requirement

I'm stuck with other work, I recently pushed half finished branch, containing a ton of fixes and changes but not finished. Also moved from falcon_main to "ggfalcon" which is meant to replace the main example and other examples later on with API support.

The real big improvement, I was not able to complete yet, is to calculate the KV mulmat operations on CUDA. Broadcasting of the first tensor is required (which basically is just repeating it 128 times per batched token, so -b 100 would cause 12800 multiplications sequentially two times. Except it's a single GPU environment then there might be more parallelism behind it)

We do have cublas 8 bit support in that branch! Which is very fast (but not faster than quantized multiplication which is default). The branch also supports on demand change of the matmul method (cublas 8,16,32,quantized,cpu), so it's easy to test and switch.

What I believe should be done is broadcasting and batched cublas in 8 bit for the two KV cache multiplications. That should bring an enormous boost in performance.

Potential roadblock: The current operation routine in the cuda code is not usable for that, it would loop tens of thousands of times for batched broadcasted processing and that can not be used to feed into batched cublas. non-batched cublas is also useless that way. I just did some dry tests (broadcasting the input, not aligning the output properly) and the slowdown compared to CPU was huge. But that can be solved, likely with a dedicated routine.

Anyone here has cuda/cublas experience who'd like to give that a try ?

cmp-nct / ggllm.cpp

#1 performance requirement #83