Open ggerganov opened 1 week ago
I think I'll have a try to put it in a new backend...
It as not the standart sgemm API, nor support thread inside.
It may be better to keep it in the CPU backend to avoid the overhead of stopping and starting the threads that happens when switching to a different backend.
I'll have a look but I don't think that thread have to be (is?) started/stopped. It can be leave in thread pool.
I have create a backend that only compute matmul for FP8 test and use OpenMP thread inside, it was even faster that tinyBLAS. And that is the case for the other BLAS backend.
Never mind, I'll have a try to see how hard it is, and if I success we can bench it. If it have slow down we can think on ggml thread API to avoid it.
Update: I have look at ggml_graph_compute and https://github.com/ggerganov/llama.cpp/pull/1999 ... I need more time to have complete view on the threads part. My first impression is that maybe we should move the thread provisioning out of the CPU backend and make it usable by other backends. But I didn't spend much time analyzing it.
Update: On CPU "backend" (at least), BLAS and AMX only compute part of the graph a have there own thread management.
The
LLAMAFILE
SGEMM routines are currently called directly from withinggml-cpu.c
based on compile-time conditionals:https://github.com/ggerganov/llama.cpp/blob/a9e8a9a0306a8093eef93b0022d9f45510490072/ggml/src/ggml-cpu.c#L7454-L7481
In order to simplify the logic and reduce the coupling of the different BLAS implementations, the
LLAMAFILE
code should be moved into aggml
backend, similar to the other BLAS implementations.Not sure if it has to be a new backend, or if we can move it in the existing
ggml-blas
backend - TBD.