Closed hrishi121 closed 1 month ago
Hi @hrishi121, the raw decode benchmarks are run with MLC script and 16 tokens input, 128 tokens output: https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/mlc#benchmarks
Thank you @dusty-nv ; this is really helpful! And I also see that the benchmark script is being called from outside the container. I'll try this with the dustynv/mlc:0.1.0-r36.3.0 image that you have provided.
I actually tried out the dustynv/mlc:0.1.1-r36.3.0 image on Jetson Orin NX, and for TinyLlama-1.1B with q4f16_ft
quantization, I am able to get avg 86 tokens/sec
of decode rate (which is great since it's slightly higher than the decode rate posted on the blogs for Jetson Orin Nano)
I'll try and see if I can run the benchmark.sh
and test.sh
with the NanoLLM
image as well (and see if I get similar numbers)
Once again, thank you @dusty-nv
(Closing the issue for the time being)
Hello,
I am trying to reproduce the benchmarks for
TinyLlama-1.1B
posted on the Jetson-AI-lab page. The chart shows that for Jetson Orin Nano, the text generation rate for theTinyLlama
model is about68 tokens/sec
.However, on my Jetson Orin NX, the decode rate I observed was anywhere between
30 - 45 tokens/sec
. I am using the dustynv/nano_llm:r36.2.0 image provided in the jetson-containers repoFor the benchmark, here's the command that I use
And I am using 512 tokens as input.
Furthermore, I am using a 12V-3A power adaptor, and have set the power mode to
sudo nvpmodel -m 0
(Is the input power the reason that's holding back the performance?)My hardware specification is:
I am not totally sure why I am seeing lower decode rate than the one posted on the Jetson AI webpage; would really appreciate any pointers!!
Thank you.