-
### Motivation
Recently, Tsinghua University proposed a survey related to LLM inference acceleration, comparing TensorRT LLM and LMDeploy under AWQ. From the results, **LMDeploy has a higher speed-up…
-
Hi,
Could you please help provide guide on integrating deepspeed approach of using multi-GPU Intel Flex 140 to run model inference using FastAPI and uvicorn setting ?
model id: 'meta-llama/Llama-2-7…
-
Hi there,
I am wondering what hardware does ray use for serving in this llmperf leaderboard. Is it cpu or gpu? if it is GPU what's the model?
Thanks,
Fizzbb
-
### 🚀 The feature, motivation and pitch
This library https://github.com/mit-han-lab/qserve , introduces a number of innovations. More importantly is the W4A8KV4 Quantization, called on the paper (htt…
-
### System Info
x86_64
Ubuntu20.04
A100x8
TRT-LLM version v0.9.0
### Who can help?
_No response_
### Information
- [X] The official example scripts
- [ ] My own modified scripts
…
-
Hey,
Currently, Ollama is saving models locally on a cache. To maintain different versions of LLMs or finetuned ones and also for extensive monitoring it's a good idea to provide integration with M…
-
Hi, I am trying to run vllm-serving for the neural-chat model using https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/vLLM-Serving . However facing this issue
![image](htt…
-
-
-
/kind feature
**Describe the solution you'd like**
Hope add [https://github.com/xorbitsai/inference](https://github.com/xorbitsai/inference) as the kserve huggingface LLMs serving runtime
Xor…