-
### Is your enhancement related to a problem? Please describe
Seems openweb-ui can be integrated through a container, would be good to prototype
### Describe the solution you'd like
Replace current…
-
Triton inference server:r24.07 and model_analyzer:1.42.0
config.pbtxt
```
backend: "python"
max_batch_size: 32
input [
{
name: "IN0"
data_type: TYPE_STRING
dims: [ 16 ]
}
]…
-
I'm wondering if there is a plan to deploy on ANE
https://machinelearning.apple.com/research/neural-engine-transformers
This year at WWDC 2022, Apple is making available an open-source referenc…
-
Good day everyone, I am trying to run llama agentic system on RTX4090 with FP8 Quantization for the inference model and meta-llama/Llama-Guard-3-8B-INT8 for the Guard. WIth sufficiently small max_seq_…
anret updated
1 month ago
-
I am experiencing inference speed slowdown when running our test scripts with just the library alone or using our server.
The slowdown usually happens after half an hour.
### My System
- Int…
-
### System Info
[libprotobuf ERROR /tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/grpc-repo/src/grpc/third_party/protobuf/src/google/protobuf/text_format.cc:335] Error parsing text-…
-
If a piper http server comes under heavy load, GPU memory usage can spike up multiple GBs and remain high until the server is stopped. Sometimes requests can get OOM errors if memory usage increases t…
-
## Problem
Currently, at least in my experience, it is rare for the app to correctly recognize most words from first try, even under noise-free conditions. Subsequent cleaning up of the text could …
-
With larger models, like Mistral-Large, I encounter that the UI-endpoint I am using (for example Typing Mind) loses connection with the endpoint, but generation continues in the background and doesn't…
-
Hello! Thanks for your nice work. I am tring to run FSOD evaluation demo on coco dataset. But the inference speed of the inference phase is quite low on one single 4090 GPU. The evaluation of 5000 im…