-
### System Info
I've converted Llama 3 using TensorRT-LLM's convert_checkpoint script, and am serving it with the inflight_batcher_llm template. I'm trying to get diverse samples for a fixed input,…
-
### Motivation
When we use LMDeploy for Serving, although throughput is also a concern, **more emphasis is placed on throughput under latency constraints with different QPS**. This is a performance m…
-
/kind bug
I created a [ClusterServingRuntime](https://github.com/supertetelman/nim-kserve/blob/main/runtimes/24.01-nim_llm.yaml) that looks like this:
```
apiVersion: serving.kserve.io/v1alpha1…
-
/kind feature
**Describe the solution you'd like**
Currently it is not possible to specify at what path the downloaded model should be available in the model server container. The downloaded model…
-
Hi team,
I am working on NIM deploy on Amazon EKS pattern. ref: https://github.com/awslabs/data-on-eks/issues/560
I tried to deploy the NIM container with helm chart, and I am using a shared st…
-
### Your current environment
```Collecting environment information...
Traceback (most recent call last):
File "/home/yangzhiyu/workspace/open-long-agent/collect_env.py", line 721, in
main()…
-
I'm running Kserve serving with arena, with the following command:
```
arena serve kserve \
--name=qwen \
--image=vllm/vllm-openai:0.4.1 \
--gpus=1 \
--cpu=4 \
--memory=20Gi…
-
followed installation of Vllm via this [link](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/vLLM_quickstart.html)
tried running via docker too: Here is the [image](https://hub.docke…
-
What is the resource requirement of the deployed model? Explain the resources defined for the model pod.
What is the throughput of the model? How can we increase the throughput?
Given a combinat…
-
### Your current environment
docker image: vllm/vllm-openai:0.4.2
Model: https://huggingface.co/alpindale/c4ai-command-r-plus-GPTQ
GPUs: RTX8000 * 2
### 🐛 Describe the bug
The model works f…