-
https://github.com/PaddlePaddle/PaddleOCR/issues/7456
请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem
- 系统环境/System Environment:Windows11
- 版本号/Version: P…
-
Server -> Receiving message of size: 24883378
Server -> 24883378 bytes read
Server -> Message parsed
Server -> Received inference request
Server -> Requesting inference on model: densepose
Server…
-
A few options to explore
1. NVIDIA NeMo, TensorRT_LLM, Triton
- NeMo
Run [this Generative AI example](https://github.com/NVIDIA/GenerativeAIExamples/tree/main/models/Gemma
) to build Lora wi…
-
### Motivation
I found that the input token logprob is supported by Offline Inference Pipeline, as mentioned in [doc](https://lmdeploy.readthedocs.io/en/latest/inference/vl_pipeline.html#calculate-lo…
-
[slack conversation](https://seldondev.slack.com/archives/C03DQFTFXMX/p1692295520100029)
What is the behavior of seldon core v2 in the following scenario?
- A single server with HPA based on 50%…
-
**Description**
A clear and concise description of what the bug is.
r23.04
```
I0718 11:39:24.385839 1 server.cc:653]
| Model | Version | Status …
-
**LocalAI version:**
Using Docker image:
`localai/localai:latest-aio-gpu-hipblas`
**Environment, CPU architecture, OS, and Version:**
- Ubuntu 22.04
- Xeon X5570 [Specs](https://ark.intel.c…
-
**Description**
I used the latest image version 24.06 because the corresponding latest version of trt has support for BF16. But when I deploy the model with trt-backend. I used perf_analyze to pressu…
-
My server has 8 GPUs and when running
```
python inference.py
```
It can load all models, but when input with image and question it raises an error with:
RuntimeError: Expected all tensors to b…
-
**Description**
I run the model on triton inference server and also on ORT directly. Inference time on triton inference server is 3 ms, but it is 1 ms on ORT. In addition, there isn't any communicati…