-
**Is your feature request related to a problem? Please describe.**
When thinking about using localAI in a production environment to serve an open source llm with an OpenAI compatible API things like …
-
**Setup**
Machine: AWS Sagemaker ml.p4d.24xlarge
Model: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1
Used Docker container image with the latest build of trt-llm (`0.8.0.dev2024011…
-
### Proposal to improve performance
_No response_
### Report of performance regression
_No response_
### Misc discussion on performance
To reproduce vLLM's performance benchmark, please…
-
### Checked other resources
- [X] I added a very descriptive title to this issue.
- [X] I searched the LangChain documentation with the integrated search.
- [X] I used the GitHub search to find a…
-
There is no release for 3 months and just few commits recently, so will this project be actively maintained?
I tried serve using ray-llm with some LLM, and need to update transformers, install tikt…
-
### Your current environment
my vllm version is
pip show vllm
Name: vllm
Version: 0.3.3+git3380931.abi0.dtk2404.torch2.1
Summary: A high-throughput and memory-efficient inference and serving eng…
-
Hello,
I've encountered an issue where the request launcher does not allow the next requests to be sent until all requests specified by `num_concurrent_requests` have finished.
This behavior see…
-
Hello,
I am using a fine tuned open source LLM and it works great in the Docker after following the instructions to build TensorRT-LLM.
However, after building the wheel install package I am not …
-
So we're having issues inferencing efficiently at scale, and of course we're processing the audio parts one by one as is default for inference, but is there any support for batch inference to speed th…
-
### System Info
- tensorrtllm_backend built using Dockerfile.trt_llm_backend
- main branch tesnorrt llm (0.13.0.dev20240813000)
- 8xH100 SXM
- Driver Version: 535.129.03
- CUDA Version: 12.5
…