LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
5.33k stars 363 forks source link

on linux, link against Openblas Parallel (e.g. for fedora 40) #1077

Open AndLLA opened 3 months ago

AndLLA commented 3 months ago

on recent linux distros (e.g. fedora 40), the paralell version of openblas has a "p" suffix "-lopenblasp", therefore linking against "-lopenblas" always uses the serial version.

in addition, we print out at runtime the exact flavour of openblas used:

ggml_backend_blas_init: openblas_get_parallel 1 
ggml_backend_blas_init: openblas_get_config OpenBLAS 0.3.26 DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=128 
ggml_backend_blas_init: GGML_USE_OPENMP n/a 
LostRuins commented 3 months ago

Is it actually faster?

AndLLA commented 3 months ago

I did several tests and the BLAS stage seems to scale with NumCores/2, for example if a BLAS stage with "openblas+serial" takes 3 mins, the "openblas+parallel" takes 1 min, using 6 cores/threads and a batch size of 512, running everything on the cpu.

On the other side, the speed increase for the inference stage is less noticeable (about 5%-10% faster).

Compared to "koboldcpp_default", the BLAS stage using "openblas+parallel" is 10%-20% faster.

p.s. the openblas_set_num_threads is completely ignored in the "serial" openblas, it always uses one thread.

henk717 commented 2 months ago

Looked at conda since we'd be implementing it in the CI based on conda. Libopenblas's latest version ships openblasp and apparently openblas regular .so is symlinked to this. Are you sure Fedora isn't doing the same thing? Because our prebuilt binaries probably already use the parralel version.

Does mean we can probably drop-in replace the windows .dll.

AndLLA commented 2 months ago

Hallo, part of the patch introduces a log line which reports the flavour of openblas.

for example if the runtime uses the parallel flavour, the output will be something like this:

ggml_backend_blas_init: openblas_get_parallel 1 ggml_backend_blas_init: openblas_get_config OpenBLAS 0.3.26 DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=128

instead, if the runtime uses the non-parallel flavour, the output will be something like this:

ggml_backend_blas_init: openblas_get_parallel 0 ggml_backend_blas_init: openblas_get_config OpenBLAS 0.3.26 DYNAMIC_ARCH NO_AFFINITY Haswell SINGLE_THREADED

On fc40 there isn't a symlink pointing to the parallel openblas by default. here what I see on the file system (re-installed the latest rpm to make sure):

-rwxr-xr-x. 1 root root 40779408 Feb  9  2024 libopenblasp-r0.3.26.so
lrwxrwxrwx. 1 root root       23 Feb  9  2024 libopenblasp.so -> libopenblasp-r0.3.26.so
lrwxrwxrwx. 1 root root       23 Feb  9  2024 libopenblasp.so.0 -> libopenblasp-r0.3.26.so

-rwxr-xr-x. 1 root root 39286328 Feb  9  2024 libopenblas-r0.3.26.so
lrwxrwxrwx. 1 root root       22 Feb  9  2024 libopenblas.so -> libopenblas-r0.3.26.so
lrwxrwxrwx. 1 root root       22 Feb  9  2024 libopenblas.so.0 -> libopenblas-r0.3.26.so

I don't know on windows, but on linux they are "drop-in" replaceable :)

Thanks