Open robosina opened 2 months ago
In the tutorial Running LLaMA 3 8B with TensorRT-LLM on Serverless GPUs, you mentioned a GitHub link for the trt-llm source code, but it seems to no longer exist:
https://github.com/CerebriumAI/examples/tree/master/15-tensor-trt
My primary issue is the claim of reaching ~4500 output tokens per second on a single Nvidia A100. I'm only getting around 200 tokens/second. Am I missing something, or did you apply any optimization techniques to achieve the higher rate?
Hi @robosina
200 definitely doesn't sound right - you can see the official benchmarks here even: https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-overview.md
In the tutorial Running LLaMA 3 8B with TensorRT-LLM on Serverless GPUs, you mentioned a GitHub link for the trt-llm source code, but it seems to no longer exist:
https://github.com/CerebriumAI/examples/tree/master/15-tensor-trt
My primary issue is the claim of reaching ~4500 output tokens per second on a single Nvidia A100. I'm only getting around 200 tokens/second. Am I missing something, or did you apply any optimization techniques to achieve the higher rate?