Cannot Reproduce the Published Benchmarking Result on Jetson AI Lab Webpage

ramyadhadidi commented 1 month ago

Hello,

I've been trying to reproduce the benchmarking results published on the Jetson AI Lab webpage (here).

First of all, it is unclear which Jetson Orin models are utilized, i.e., AGX: [32GB, 64GB, or Industrial] / Nano: [4GB or 8GB].

Second, I'm having a hard time matching the reported numbers. I get 3-4x lower performance. I followed the MLC guide here with some fixes (see https://github.com/dusty-nv/jetson-containers/issues/529). For instance:

For Llama2-7b, I get ~19 tokens/sec (versus 47).
For Gemma, I get ~23 tokens/sec (versus 75).
For Phi2, MLC just says it does not support Phi2.

I have an AGX Orin 32GB, so partially these lower numbers might be related to the lower memory and TOPS difference of the two AGX variants (~1.4x). However, I don't think this is the main reason. None of the above models are bottlenecked by the smaller memory size of the AGX 32GB, and the TOPS difference is not significant considering that Attention is a memory-bound operation (memory bandwidth of both models are same).

dusty-nv commented 1 month ago

Hi Ramyad, this was with AGX Orin 64GB in MAX-N power mode. Other folks typically have to change to this power mode (using nvpmodel tool) then they get the same 47 tokens/sec with Llama-2-7B for example. IIRC the AGX 32GB has fewer cores not just memory capacity.

From: Ramyad @.> Sent: Monday, May 20, 2024 2:18:28 PM To: dusty-nv/jetson-containers @.> Cc: Subscribed @.***> Subject: [dusty-nv/jetson-containers] Cannot Reproduce the Published Benchmarking Result on Jetson AI Lab Webpage (Issue #532)

Hello,

I've been trying to reproduce the benchmarking results published on the Jetson AI Lab webpage (herehttps://www.jetson-ai-lab.com/benchmarks.html).

First of all, it is unclear which Jetson Orin models are utilized, i.e., AGX: [32GB, 64GB, or Industrial] / Nano: [4GB or 8GB].

Second, I'm having a hard time matching the reported numbers. I get 3-4x lower performance. I followed the MLC guide herehttps://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/mlc with some fixes (see #529https://github.com/dusty-nv/jetson-containers/issues/529). For instance:

For Llama2-7b, I get ~19 tokens/sec (versus 47).
For Gemma, I get ~23 tokens/sec (versus 75).
For Phi2, MLC just says it does not support Phi2.

I have an AGX Orin 32GB, so partially these lower numbers might be related to the lower memory and TOPS difference of the two AGX variants (~1.4x). However, I don't think this is the main reason. None of the above models are bottlenecked by the smaller memory size of the AGX 32GB, and the TOPS difference is not significant considering that Attention is a memory-bound operation (memory bandwidth of both models are same).

— Reply to this email directly, view it on GitHubhttps://github.com/dusty-nv/jetson-containers/issues/532, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADVEGK53S2E75QQOPUGUZDLZDI47JAVCNFSM6AAAAABIAFGZJSVHI2DSMVQWIX3LMV43ASLTON2WKOZSGMYDMNJRHA2DMNI. You are receiving this because you are subscribed to this thread.Message ID: @.***>

ramyadhadidi commented 1 month ago

Hi Dustin, thanks. You are right, I am on 30W mode. I will close this issue and reopen it if I cannot produce the result that is almost similar to AGX 64GB.

dusty-nv / jetson-containers

Cannot Reproduce the Published Benchmarking Result on Jetson AI Lab Webpage #532