-
**Is your feature request related to a problem? Please describe.**
App SDK currently supports inference within the application process itself. This is simple and efficient for some use cases, though …
-
@wangg12 @shanice-l @Rainbowend @tzsombor95 need your help.
The inference script runs successfully without any errors when executed as a standalone Python script. By when running with ros2, ie., …
-
Hello.
I am writing to inquire about the PyTorch version used in the Triton Inference Server 24.01 release.
Upon reviewing the documentation, I noticed that Triton 24.01 includes PyTorch version…
-
**Problem Description**
Many hospitals we work with have multiple servers (e.g., one with GPU for training and another without for inference). Right now, it's not possible to add multiple nodes from…
-
### System Info
System Info
TGI Docker Image: ghcr.io/huggingface/text-generation-inference:sha-11d7af7-rocm
MODEL: meta-llama/Llama-3.1-405B-Instruct-FP8
Hardware used:
Intel® Xeon® Platinum 8…
Bihan updated
3 weeks ago
-
**Description**
I noticed that a model with several instances is slower than with one. I believe that this should not be the case, but throughput and latency indicators say the opposite.
**Triton …
-
A test failed on a tracked branch
```
Error: Expected status code 200, got 500 with body '{"statusCode":500,"error":"Internal Server Error","message":"[status_exception\n\tCaused by:\n\t\tillegal_arg…
-
Tracking the second round of issues submitted to [triton-inference-server](https://github.com/triton-inference-server/server):
- [ ] https://github.com/triton-inference-server/server/issues/2018: Con…
-
In order to profile and optimize the current inference server architecture and best tune its hyper-parameters for various applications, it would be very useful for AlphaZero.jl to have a mode where it…
-
### Anything you want to discuss about vllm.
In vllm, I tested that the speed of concurrent server api requests is greater than the speed of offline inference. I would like to ask if there are any pe…