dusty-nv / NanoLLM

Optimized local inference for LLMs with HuggingFace-like APIs for quantization, vision/language models, multimodal agents, speech, vector DB, and RAG.
https://dusty-nv.github.io/NanoLLM/
MIT License
176 stars 26 forks source link

[Question] Reproducing benchmarks for TinyLlama1.1B #38

Closed hrishi121 closed 1 month ago

hrishi121 commented 1 month ago

Hello,

I am trying to reproduce the benchmarks for TinyLlama-1.1B posted on the Jetson-AI-lab page. The chart shows that for Jetson Orin Nano, the text generation rate for the TinyLlama model is about 68 tokens/sec.

However, on my Jetson Orin NX, the decode rate I observed was anywhere between 30 - 45 tokens/sec. I am using the dustynv/nano_llm:r36.2.0 image provided in the jetson-containers repo

For the benchmark, here's the command that I use

python3 -m nano_llm.completion --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --quantization q4f16_ft --api mlc --max-new-tokens 10

And I am using 512 tokens as input.

Furthermore, I am using a 12V-3A power adaptor, and have set the power mode to sudo nvpmodel -m 0 (Is the input power the reason that's holding back the performance?)

My hardware specification is:

Jetson Orin NX
-- L4T_VERSION=36.3.0
-- JETPACK_VERSION=6.0
-- CUDA_VERSION=12.4
-- PYTHON_VERSION=3.10
-- LSB_RELEASE=22.04 (jammy)

I am not totally sure why I am seeing lower decode rate than the one posted on the Jetson AI webpage; would really appreciate any pointers!!

Thank you.

dusty-nv commented 1 month ago

Hi @hrishi121, the raw decode benchmarks are run with MLC script and 16 tokens input, 128 tokens output: https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/mlc#benchmarks

hrishi121 commented 1 month ago

Thank you @dusty-nv ; this is really helpful! And I also see that the benchmark script is being called from outside the container. I'll try this with the dustynv/mlc:0.1.0-r36.3.0 image that you have provided.

hrishi121 commented 1 month ago

I actually tried out the dustynv/mlc:0.1.1-r36.3.0 image on Jetson Orin NX, and for TinyLlama-1.1B with q4f16_ft quantization, I am able to get avg 86 tokens/sec of decode rate (which is great since it's slightly higher than the decode rate posted on the blogs for Jetson Orin Nano) I'll try and see if I can run the benchmark.sh and test.sh with the NanoLLM image as well (and see if I get similar numbers) Once again, thank you @dusty-nv (Closing the issue for the time being)