-
Hugginface hub login successful
Used gemma2-27b LLM to testing:
cargo run --release -- -m "google/gemma-2-27b-it" -c
Finished release [optimized] target(s) in 0.03s
Running `target/re…
-
[bge-multilingual-gemma2](https://huggingface.co/BAAI/bge-multilingual-gemma2)和[bge-reranker-v2.5-gemma2-lightweight](https://huggingface.co/BAAI/bge-reranker-v2.5-gemma2-lightweight)的最大输入token是多少?
-
# Prerequisites
Please answer the following questions for yourself before submitting an issue.
- [ Yes] I am running the latest code. Development is very rapid so there are no tagged versions as…
-
### Your current environment
- vLLM CPU : v0.6.0
- Hardware: Intel(R) Xeon(R) Platinum 8480+ CPU
- Model: google/gemma-2-2b
### 🐛 Describe the bug
vLLM v0.6.0 (cpu) is throwing below erro…
-
### Your current environment
```text
The output of `python collect_env.py`
```
Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTo…
-
### What happened?
If you pass `tfs_z` param to the server, it crashes sometimes.
Starting the server:
```
~/test/llama.cpp/llama-server -m /opt/models/text/gemma-2-27b-it-Q8_0.gguf --verbose
`…
-
Hi.
I'm an early adopter of unsloth and my recent experiments with the library delivered unexpected latency results.
I followed the official notebooks and got the following results while fine tuning…
-
### Model description
bge-reranker-v2.5-gemma-lightweight 's performance is better bge-m3 :)
Please support model.
### Open source status
- [ ] The model implementation is available
- [X] The …
-
### System Info / 系統信息
Ubuntu 22.04.4 LTS
python 3.10
transformer 4.43.0
cuda 12.0
torch 2.3.0
vllm 0.4.3
### Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?
- [ ] docker / docke…
-
### Feature request
Hi,
Is it possible to enable flash attention for PaliGemma models?
### Motivation
This feature is required to speed up inference using PaliGemma VLMs
### Your contribution
…