-
**Setup**
Machine: AWS Sagemaker ml.p4d.24xlarge
Model: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1
Used Docker container image with the latest build of trt-llm (`0.8.0.dev2024011…
-
Hi, thanks for the great work!
What if I want to support larger model, say, beyonds one gpu card's memory and needs tp. Is there a reason why qserve [doesn't support tp](https://github.com/mit-han-…
-
As per title.
Example: with GPUs like 3060 12GB or 3090 24GB.
-
If I take something like the DEITA pipeline from the docs and replace OpenAILLM with TransformersLLM, running the pipeline will load my hf transformers model 4 times. Am I doing something wrong here? …
-
## What are the problems?(screenshots or detailed error messages)
## What are the types of GPU/CPU you are using?
## What's the operating system ppl.llm.serving runs on?
## What's the compile…
-
Suppose we want to load-test an API which uses server-sent events (SSE). Is it possible to measure the time-to-first-byte using Goose?
-
Hello,
I am using a fine tuned open source LLM and it works great in the Docker after following the instructions to build TensorRT-LLM.
However, after building the wheel install package I am not …
-
Is there comparison performance data between ScaleLLM and vLLM
-
The current model is unable to calculate the time spent on first token and rest tokens, can we add this msg ?
-
Greetings, @cipher982!
I've seen the benchmark application https://www.llm-benchmarks.com/local and it looks great! I'm currently working on a competitive analysis of this 4 backends: Transformers…