-
run inference with /TensorRT-LLM/examples/run.py , it's ok
mpirun -n 4 -allow-run-as-root python3 /load/trt_llm/TensorRT-LLM/examples/run.py \
--input_text "hello,who are you?" \
…
-
[ ] I have checked the [documentation](https://docs.ragas.io/) and related resources and couldn't resolve my bug.
**Describe the bug**
```
>>> generator.adapt(language, evolutions=[simple])
Trac…
-
Currently I'm using llm to generate streaming response, and I found that triton only supports streaming output through the grpc protocol. [https://docs.nvidia.com/deeplearning/triton-inference-server/…
-
### Your current environment
I have a server with only one NVLink connection, so I need to use pipeline parallelism and tensor parallelism within a single node to improve its performance. I would lik…
-
### Feature request type
sample request
### Is your feature request related to a problem? Please describe
In the documentation there is always a reference to the `Mkldnn` usage but, apparently, the…
-
### System Info
- CPU architecture : x86_64
- CPU/Host memory size : 32 GB
- GPU name L4 at g2-standard-8 (GCP)
- GPU memory size 24GB
- TensorRT-LLM branch or tag (e.g., main, v0.10.0)
- Nvi…
-
I would like to use techniques such as Multi-instance Support supported by the tensorrt-llm backend. In the documentation, I can see that multiple models are served using modes like Leader mode and …
-
**Is your feature request related to a problem? Please describe.**
1. We would like to try parallel model execution on iGPU+DLA devices. Is it possible to run triton-inference-server on a V3NP or Ori…
-
**What would you like to be added**:
ollama provides [sdk](https://github.com/ollama/ollama-python) for integrations, we can easily integrate with it, one of the benefits I can think of is olla…
-
Current `Cluster` deployment only allows inference servers to be deployed on GPU [see here](https://github.com/fmperf-project/fmperf/blob/b7ae68125724d3c63563fd84eebba7eee347e27f/fmperf/Cluster.py#L13…