CerebriumAI / examples

443 stars 60 forks source link

Missing link in the tutorials #50

Open robosina opened 2 months ago

robosina commented 2 months ago

In the tutorial Running LLaMA 3 8B with TensorRT-LLM on Serverless GPUs, you mentioned a GitHub link for the trt-llm source code, but it seems to no longer exist:

https://github.com/CerebriumAI/examples/tree/master/15-tensor-trt

My primary issue is the claim of reaching ~4500 output tokens per second on a single Nvidia A100. I'm only getting around 200 tokens/second. Am I missing something, or did you apply any optimization techniques to achieve the higher rate?

milo157 commented 2 months ago

Hi @robosina

200 definitely doesn't sound right - you can see the official benchmarks here even: https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-overview.md