-
Hello developers,
I was using cake to deploy distributed LLAMA3 8B Instruct model to 2 GPUs and I got the error below:
```
CUDA_VISIABLE_DEVICES=0 ./target/release/cake-cli --model ~/.cache/hugging…
-
Things left to do after merging #300
# OpenAI
It seems to work ok with OpenAI in my limited testing.
# OpenRouter
I tested it with some models via OpenRouter and noticed it sometimes gets …
-
### Proposal to improve performance
_No response_
### Report of performance regression
_No response_
### Misc discussion on performance
---
**Setup Summary for vLLM Benchmarking with Llama…
-
### Prerequisites
- [X] I am running the latest code. Mention the version if possible as well.
- [X] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md)…
-
## Description
[Cross-region inference (CRI)](https://docs.aws.amazon.com/bedrock/latest/userguide/cross-region-inference.html) allows requests to be automatically routed within any set of region…
-
We noticed the current triton decoding kernel is very slow on long context. This is due to a missing flash decoding like optimization.
## Reproduce
We test the decoding speed with a context length…
-
Hello.
It seems that the latest Ollama with IPEX-LLM version `(0.3.6)` is a little old nowadays.
It doesn't have proper support for new and popular models like:
1) `Phi 3.5`
2) `Qwen 2.5`
3…
-
### System Info
requirements file:
```
transformers[torch]==4.44.2
onnxruntime
-
In this repo the Llama3 tokenizer sets the `` special token to `128011` https://github.com/meta-llama/llama-models/blob/ec6b56330258f6c544a6ca95c52a2aee09d8e3ca/models/llama3/api/tokenizer.py#L79-L101…
-
### What happened?
llama.cpp使用QWen2.5-7b-f16.gg在310P3乱码
### Name and Version
./build/bin/llama-cli -m Qwen2.5-7b-f16.gguf -p "who are you" -ngl 32 -fa
### What operating system are you seeing the …