-
### Description
Typescript autocompletionand types resolutiuon is not working in Webstorm on ejected theme
### CodeSandbox/Snack link
_No response_
### Steps to reproduce
See new comments in this…
-
**Description**
While building from source, the build fails when tensorrt_llm backend is chosen.
**Triton Information**
What version of Triton are you using? r24.04
Are you using the Triton co…
-
### Description
```shell
Docker: nvcr.io/nvidia/tritonserver:23.04-py3
Gpu: A100
How can i stop bi-direction streaming(decoupled mode)?
- I want to stop model inference(streaming response) when …
-
### Your current environment
```text
tiktoken==0.6.0
transformers==4.38.1
tokenizers==0.15.2
vLLM Version: 0.4.3
fastchat Version: 0.2.36
```
### 🐛 Describe the bug
Currently, I'm using fa…
-
```
root@ttogpu:~# kubectl describe pod triton-inference-server-5b6c7f889c-f54c6
Name: triton-inference-server-5b6c7f889c-f54c6
Namespace: default
Priority: 0
Service …
-
## Description
I have two different module and convert to trt. when I run them in Serial. the cost time of only infer:
```
//10 times
do_infer >> cost 400.60 msec. //warn-up
do_infer >> cost 42.22 …
-
Here is the development roadmap for 2024 Q4. Contributions and feedback are welcome ([**Join Bi-weekly Development Meeting**](https://t.co/4BFjCLnVHq)). Previous 2024 Q3 roadmap can be found in #634.
…
-
Hi, Dear NJU-Jet
my linux server: several 2.6GHz CPU + several V100, and I run the **generate_tflite.py** to got a quantized model.
and then in function **evaluate**, I add below code to measu…
-
**Is your feature request related to a problem? Please describe.**
I am asking the recommended way to achieve the following behavior.
SCENARIO: I have many different models. Consider them differen…
elmuz updated
4 months ago
-
### Describe the issue
Can I run "python -m vllm.entrypoints.openai.api_server" to load MInference capabilities in VLLM?