-
hi, can you share some performance data on MTK or Qualcomm chips?
such as QWen or Gemma model's prefill and decode speed?
thanks very much.
yuimo updated
3 weeks ago
-
**Is your enhancement related to a problem? Please describe.**
Currently, the installation process does not allow for specifying the CUDA version, the code is hardcoded to use the llama-box binary wi…
-
Hello Guys,
I'm wondering about performence, which is very strange
on the same server, i ran the same model with query, and the loading time is totally differente between llama-cpp python and ll…
-
# Prerequisites
Please answer the following questions for yourself before submitting an issue.
- [X ] I am running the latest code. Development is very rapid so there are no tagged versions as o…
-
### **Error code:**
RuntimeError Traceback (most recent call last)
[](https://localhost:8080/#) in ()
----> 1 model.save_pretrained_gguf("model", tokenizer,)
1 fra…
-
lama.cpp dropped support for converting lora to ggml, it would be very useful if we could use adapters with llama.cpp instead of fusing or merging the fine tuned model.
-
### What is the issue?
Archlinux, python3.12
```
(ollama) ╭─hougelangley at Arch-Legion in ~/ollama on main✘✘✘ 24-05-24 - 11:00:23
╰─(ollama) ⠠⠵ pip install -r llm/llama.cpp/requirements.txt
Co…
-
Will using only CPU be faster than llama.cpp?
-
### Describe the bug
Inference fails after prompt evaluation with llama-cpp backend with error:
```
CUDA error: invalid argument
current device: 1, in function ggml_backend_cuda_graph_compute …
-
Hi there,
I'm following this instruction to build llama.cpp from scratch:
https://github.com/ggerganov/llama.cpp#cublas
I run it in ubuntu in WSL.
CPU inference works for me with no issue, but w…