-
### Proposal to improve performance
Improve bitsandbytes quantization inference speed
### Report of performance regression
I'm testing llama-3.2-1b on a toy dataset. For offline inference using the…
-
## 🐛 Bug Report
**🔎 Describe the Bug**
Give a clear and concise description of the bug.
I have a fastapi uvicorn server which serves multiple concurrent requests. In each of the call, I am using …
rupav updated
2 months ago
-
用以下方式验证glm4-9b-chat模型的输出,serving端报错
curl --request POST \
--url http://127.0.0.1:8000/v1/chat/completions \
--header 'content-type: application/json' \
--data '{
"model": "glm-4-9…
-
Hi there,
I am wondering what hardware does ray use for serving in this llmperf leaderboard. Is it cpu or gpu? if it is GPU what's the model?
Thanks,
Fizzbb
-
### 🚀 The feature, motivation and pitch
This library https://github.com/mit-han-lab/qserve , introduces a number of innovations. More importantly is the W4A8KV4 Quantization, called on the paper (htt…
-
Hi there,
Thank you for bringing the elegant RAG Assessment framework to the community.
I am an AI engineer from Alibaba Cloud, and our team has been fine-tuning LLM-as-a-Judge models based on t…
-
Hi @hadley, thanks for sharing this, really exciting.
Very nice to see support for open models via ollama. I wonder if you would consider adding support for VLLM-hosted models as well, e.g. see ht…
-
**Is your feature request related to a problem? Please describe.**
Hello.
I tried to use letta with vllm serving qwen2.5 72B model. It returned 2 tools and letta doesn't support this
```
Response …
-
### Your current environment
These is the 0.5.0 environments
### 🐛 Describe the bug
**1、These log files:**
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-package…
-
1. How many LLMs are needed for `setting`? In your paper [PaperQA: Retrieval-Augmented Generative Agent for Scientific Research](https://arxiv.org/pdf/2312.07559.pdf), this paper seems to have employi…