Open cztomsik opened 5 months ago
Howdy, I believe the frontend only needs to detect gpu, vRAM and cpu cores (if implemented client-server rather than integrating llama.cpp statically). One could ship multiple llama backends compiled with different flags to use gpu and/or cpu optimizations. More elegant to modularize llama so that gpu and {avx(,2,512)} support are compiled to separate dynamic libraries, but not sure if the code structure makes this difficult.
This will take some time, so this is just rough sketch for later:
llama.dll
zig build
can download/extract a zip file from url likehttps://github.com/ggerganov/llama.cpp/releases/download/{short_rev}/llama-{short_rev}-bin-{blas}.zip
short_rev
is obtained from thellama.cpp
git submoduleblas
is something likewin-cuda-cu11.7.1-x64
passed as-Dblas=xxx
to azig build
.exe
and.dll
should be marked as artifactAfter this is done, we can make a windows pipeline, with matrix for each BLAS, and hopefully, we will get a
.zip
file, which people can just download and run. Of course, they still need to have given BLAS installed on their system.