Closed mathav95raj closed 1 month ago
llamafile v0.8.4
This is a old one... can you bench with at least last published release: V0.8.13 (ps: no need to rebuild juste get it from release and give it your model ( -m zzz.gguf / -m zzz.llamafile)
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size = 0.08 MiB
llm_load_tensors: CPU buffer size = 762.81 MiB
.......................................................
Also in llamafile case all is compute on CPU not with GPU (Metal) I dont know if it can use GPU on that old release.
@Djip007 My bad. I had done the above tests with latest release v0.8.13 but while filling the github issue, I mentioned the version from the issue default template. I apologise for the error. I have edited it correctly now.
Also in llamafile case all is compute on CPU not with GPU (Metal) I dont know if it can use GPU on that old release.
I have given the number of gpu layers to be offloaded as 17
llamafile/bin/llamafile-bench -m llamafiles/llama-3.2-1b-q4_k_m.llamafile -ngl 17 -n 1024 -p 512 --verbose
Does this mean -ngl
is not working as expected?
Does this mean
-ngl
is not working as expected?
Ho yes if I am right... llamafile-bench do only CPU bench for now. (but llamafile did support it ... may be with some "bug" with V0.8.13 : https://github.com/Mozilla-Ocho/llamafile/pull/534)
llamafile bench currently only supports cpu. I can put up a branch that will enable gpu support tomorrow
the fix in #534 should resolve the issue with gpu performance being slower than llama.cpp
llamafile bench currently only supports cpu. I can put up a branch that will enable gpu support tomorrow
Can be nice !
able to get correct results now. Thanks!
What happened?
For llama cpp I had downloaded the q4_k_m quantized model and used llama-bench. For ollama I pulled the q4_k_m model from ollama. By running the model with
--verbose
flag, I manually recorded the prompt eval rate for 10 trials with same prompt of approximately 512 tokens length. For llamafile I used the same model as used for llama cpp and created a llamafile and then benchmarked with llamafile-bench.llama-bench logs:
ollama logs:
Version
llamafile v0.8.4llamafile v0.8.13What operating system are you seeing the problem on?
Mac
Relevant log output