-
Looking at the release of TensorRT 9.1.0. I am very happy to see the integration of openai-triton with TensorRT plugins.
However [one limitation of this integration is that python must be availabl…
-
I am running inference tasks conveniently with CodeGen models, thanks to the FauxPilot community. Thank you again.
Additionally I wonder if it is possible to run multiple models on a single GPU.
Bel…
-
Here is the development roadmap for 2024 Q3. Contributions and feedback are welcome.
## Server API
- [ ] Add APIs for using the inference engine in a single script without launching a separate se…
-
We would like to be able to deploy multiple versions of the same model. Unfortunately, they will not necessarily always have the same shapes and dtypes.
It would be great to have a per version con…
-
https://github.com/ollama/ollama
https://github.com/abetlen/llama-cpp-python
https://github.com/vllm-project/vllm
-
I am trying to experiment with prompts and I'm unable to check whether the system is picking up my changed prompts?
1. I have overwritten the "prompt_file" for my experiment (found by checking out…
-
/kind bug
**What steps did you take and what happened:**
When I import `mlserver` and `kserve` at the same time, they might have the same name of proto descriptors that conflict each other like:…
-
### The bug
The Immich backup feature that allows photos to be backed up to the remote server causes the system data on my iPhone 15 Pro Max with iOS 17.6.1 to be filled up completely. This causes th…
-
We have a streaming service that uses gRPC with Unix sockets.
The gRPC performs way better with Unix socks in comparison with a TCP port. I saw that you can only change the port in the triton server…
-
### 起始日期 | Start Date
_No response_
### 实现PR | Implementation PR
_No response_
### 相关Issues | Reference Issues
_No response_
### 摘要 | Summary
vllm-0.3.0起服务失败
### 基本示例 | Basic Example
not supp…