ggerganov / llama.cpp

LLM inference in C/C++
MIT License
67.76k stars 9.72k forks source link

ggml : move LLAMAFILE/tinyBLAS into a backend #10183

Open ggerganov opened 1 week ago

ggerganov commented 1 week ago

The LLAMAFILE SGEMM routines are currently called directly from within ggml-cpu.c based on compile-time conditionals:

https://github.com/ggerganov/llama.cpp/blob/a9e8a9a0306a8093eef93b0022d9f45510490072/ggml/src/ggml-cpu.c#L7454-L7481

In order to simplify the logic and reduce the coupling of the different BLAS implementations, the LLAMAFILE code should be moved into a ggml backend, similar to the other BLAS implementations.

Not sure if it has to be a new backend, or if we can move it in the existing ggml-blas backend - TBD.

Djip007 commented 3 days ago

I think I'll have a try to put it in a new backend...

It as not the standart sgemm API, nor support thread inside.

slaren commented 3 days ago

It may be better to keep it in the CPU backend to avoid the overhead of stopping and starting the threads that happens when switching to a different backend.

Djip007 commented 3 days ago

I'll have a look but I don't think that thread have to be (is?) started/stopped. It can be leave in thread pool.

I have create a backend that only compute matmul for FP8 test and use OpenMP thread inside, it was even faster that tinyBLAS. And that is the case for the other BLAS backend.

Never mind, I'll have a try to see how hard it is, and if I success we can bench it. If it have slow down we can think on ggml thread API to avoid it.

Update: I have look at ggml_graph_compute and https://github.com/ggerganov/llama.cpp/pull/1999 ... I need more time to have complete view on the threads part. My first impression is that maybe we should move the thread provisioning out of the CPU backend and make it usable by other backends. But I didn't spend much time analyzing it.

Update: On CPU "backend" (at least), BLAS and AMX only compute part of the graph a have there own thread management.