-
Is it possible to increase the amount of tokens sent per chunk during the streaming process and how to do so?
This could also be with triton-inference-server
-
### System Info
- GPU: H100
- Triton Server with Tensor rt Backend (v.0.10.0)
- Launched on K8s. Docker Container built using [tensor rt builder](https://github.com/triton-inference-server/tensorrt…
-
### Search before asking
- [X] I have searched the Inference [issues](https://github.com/roboflow/inference/issues) and found no similar feature requests.
### Description
`DocTR` produces not only…
-
**Problem description**
Currently Vantage6 only supports file/db mounts, however our train/inference pipelines often need to store multiple checkpoint and result files with content that can't leave t…
-
## Description
I am moving from A30 to A40. So I needed to rebuild my onnx model for A40.
I rebuilt using the same trtexec version, the same command and the same model via the docker image as I d…
-
**Description**
I used the latest image version 24.06 because the corresponding latest version of trt has support for BF16. But when I deploy the model with trt-backend. I used perf_analyze to pressu…
-
My server has 8 GPUs and when running
```
python inference.py
```
It can load all models, but when input with image and question it raises an error with:
RuntimeError: Expected all tensors to b…
-
Triton provides an extension to the standard gRPC inference api for streaming (`inference.GRPCInferenceService/ModelStreamInfer`), this extension is required to use vLLM backend with triton.
However …
-
From req doc:
**OOTB support for NVidia Triton Inference Server**
- We are going with OpenVINO right now as Triton can not be built right now due to maintenance concerns.
Acceptance criteria:
- Scope…
-
## Dart analysis issue
Bad state: [_variableIndex: 2][_variables.length: 2][variables: [Expression expression, List cases]][element.source: /Users/brianwilkerson/src/dart/sdk/sdk/pkg/kernel/lib/ast…