-
/kind feature
**Describe the solution you'd like**
[A clear and concise description of what you want to happen.]
There are different directions:
- extend existing API for referencing multiple …
-
### Issues Policy acknowledgement
- [X] I have read and agree to submit bug reports in accordance with the [issues policy](https://www.github.com/mlflow/mlflow/blob/master/ISSUE_POLICY.md)
### Where…
-
### Your current environment
The startup command is as follows: it initiates both a standard 7B model and an n-gram speculate model. Speed tests discover that the speculate model performs more slowl…
-
I have an ensemble model,
model 1 output are 66 cropped images, model 1 is python, I manually resize/padded them to 3 batches with shape
(30, 3, 48, 320), (30, 3, 48, 976), (6, 3, 48, 1280)
(I …
mlfrd updated
2 weeks ago
-
### Description
**Description:**
When entering any entry in the chat, the request does not materialize and you get the errors "**failed to pipe response**" and "**ECONNRESET**".
**Environment:**
…
-
I was recently deploying hugging face models on the Triton inference server which helped me to increase my GPU utilization and serve multiple models using a single GPU.
Was not able to find good r…
-
**Describe the bug**
**Environment**
- GPUStack version: v0.2.0
- OS: Ubuntu 22.04
- GPU: Nvidia P40, T4, H800 (all can reproduce this issue)
**Steps to reproduce**
1. Install GP…
-
I'm trying to use Triton to deploy baichuan2-13B inference under bf16 precision. The tritonserver can be started successfully, but when processing client request, it crashed.
- Use TensorRT-LLM v0…
-
### Search before asking
- [X] I have searched the Ultralytics YOLO [issues](https://github.com/ultralytics/ultralytics/issues) and [discussions](https://github.com/ultralytics/ultralytics/discussion…
-
### System Info
I am working on the benchmarking suite in vLLM team, and now trying to run TensorRT-LLM for comparison. I am relying on this github repo (https://github.com/neuralmagic/tensorrt-demo)…