-
Hi, during request streaming it'll be helpful to have a flag to indicate end of generation. Can you help with this feature request?
I believe that means returning the bool flag from https://github.…
-
[SHARK](https://github.com/nod-ai/SHARK) is a high performance codegen compiler and runtime built on MLIR, IREE and custom RL based tuning infrastructure. [Here](https://nod.ai/shark-the-fastest-runti…
-
**Is your feature request related to a problem? Please describe.**
As documented [here](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.htm…
-
onnx version :'1.14.0'
When I convert the weight file to .onnx (half=True)
When using cpu for inference at that time
Inference speed is 1.5 times faster than .pt on my own computer (i7 12700)
Pr…
-
## Problem
AIConfig currently unnecessarily couples conversation / multi-turn chat history with the config itself. It uses 'remember_chat_context' and [extracts the conversation history from previous…
-
最后的日志显示:
qanything-container-local | Triton服务正在启动,可能需要一段时间...你有时间去冲杯咖啡 :)
qanything-container-local | The triton service is starting up, it can be long... you have time to make a coffee :)
qanyth…
-
/kind feature
**Describe the solution you'd like**
We use Kserve alongside with Kserve eventing to trigger an inference, we listen for an `io.kserve.inference.response` event to continue our wor…
-
The `server.py` does not allow multiple text inputs to be sent. Will this capability be introduced ? Is the underlying batch capability of the models being utilised while inference ?
-
I am trying to run yolov7 on triton (not the entire deepstream). I have converted .pt -> .onnx -> .trt in yolov7. All these files work successfully during inference. But when I am trying to deploy wei…
-
We would like to be able to deploy multiple versions of the same model. Unfortunately, they will not necessarily always have the same shapes and dtypes.
It would be great to have a per version con…