-
intelanalytics/ipex-llm-serving-cpu:latest
-
Wrote the according to the following example at https://distilabel.argilla.io/latest/sections/how_to_guides/advanced/serving_an_llm_for_reuse/#serving-llms-using-vllm:
```
from distilabel.llms im…
-
Opening issue to collect information on whether there is a good reason to add TensorRT as a serving backend.
https://github.com/NVIDIA/TensorRT-LLM/issues/334
-
One important (and non-trivial) aspect of running model servers today is to ensure they are able to scale horizontally in response to load. Today, traditional CPU/Memory-based autoscaling are not suff…
-
Hi there, I've been following this work for a few months and found it's really an amazing idea to run LLMs over the Internet, while I'm also trying to improve Petals' performance on model inference in…
-
Hi there,
Thank you for bringing the elegant RAG Assessment framework to the community.
I am an AI engineer from Alibaba Cloud, and our team has been fine-tuning LLM-as-a-Judge models based on t…
-
## Description
vLLM sampling parameters include a [richer set of values](https://github.com/vllm-project/vllm/blob/c9b45adeeb0e5b2f597d1687e0b8f24167602395/vllm/sampling_params.py), among which `lo…
-
I meet coredump when decoding with multi-thread. It cored in rust function `tokenizers_decode`,rust/src/lib.rs:199. here is the core backtrack.
why does it do not support multi-thread? I think dec…
-
## 🐛 Bug Report
**🔎 Describe the Bug**
Give a clear and concise description of the bug.
I have a fastapi uvicorn server which serves multiple concurrent requests. In each of the call, I am using …
-
Hello,
Similarly to #3, I've tried reproducing the `demo.py` benchmark on an H100 and an A6000 and I'm also seeing no speedup on these platforms at lower precisions.
It was mentioned this is du…