ARM-software / armnn

Arm NN ML Software. The code here is a read-only mirror of https://review.mlplatform.org/admin/repos/ml/armnn
https://developer.arm.com/products/processors/machine-learning/arm-nn
MIT License
1.17k stars 310 forks source link

Odroid N2+ @ 2.2 GHz slower than Raspberry Pi 4 @ 1.5 GHz #600

Closed psyhtest closed 1 year ago

psyhtest commented 2 years ago

I've noticed a curious thing when benchmarking ResNet50 via ArmNN v21.11 with the Neon backend on Odroid N2+ @ 2208 MHz and Raspberry Pi 4 @ 1500 MHz. Despite the 46% higher clock frequency, N2+ is actually 10% slower than RPi4: 342 ms vs 315 ms. During the execution, N2+ only uses its 4 big cores at ~50% and does not use its 2 LITTLE cores at all, while RPi4 uses its 4 big cores at ~100%.

You can follow this Jupyter notebook to reproduce with the following updates to the Performance measurement commands:

ck run cmdgen:benchmark.image-classification.tflite-loadgen --verbose \
--model=resnet50 --scenario=singlestream --mode=performance \
--library=armnn-v21.11-neon --sut=odroid --target_latency=340
ck run cmdgen:benchmark.image-classification.tflite-loadgen --verbose \
--model=resnet50 --scenario=singlestream --mode=performance \
--library=armnn-v21.11-neon --sut=rpi4coral--target_latency=310
psyhtest commented 2 years ago

I've got another interesting observation comparing two Raspberry Pi 4's: 322 ms on a new Model B Rev 1.4 8GB @ 1800 MHz vs 315 ms on a Model B Rev 1.1 4GB @ 1500 MHz. It seems the higher the frequency, the worse the latency gets. On the new RPi4, utilization is about ~93%.

MatthewARM commented 2 years ago

Thanks Anton - would you be able to attach the Arm NN event profiles from the two runs? Perhaps for some reason sub-optimal kernels are being selected, and we should be able to see that in the profiles.

psyhtest commented 2 years ago

@MatthewARM, how do I dump event profiles? Can I do it from a release build?

MatthewARM commented 2 years ago

Sorry Anton, somehow I missed this message.

If you're still curious:

I think it's enabled with the "-e" option to ExecuteNetwork, if that's the tool being used for the benchmark? It's absolutely available in a release build.

If CK is using the Arm NN API then there's good instructions in the answer #464

Hope that helps, Matthew

psyhtest commented 2 years ago

Thank you @MatthewARM. For the official MLPerf Inference v2.0 submission, we measured 339 ms for Odroid N2+ and 349 ms for Raspberry Pi 4. Seems like a RPi4 regression to me (314/349 => -10%).

matthewsloyanARM commented 1 year ago

Hi @psyhtest,

Thank you for getting in touch. As this was a while ago, it's quite possible that the regression could have been fixed or even improved upon as there are always optimizations being made to the Neon backend.

Would it be possible to run the tests using the latest version of Arm NN and Arm Compute Library with profiling enabled? We would be more than happy to take a look at your profiling results if it's still occurring, as Matthew had mentioned above, to see if it's related to a sub-optimal kernels being selected.

Kind regards,

Matthew

MatthewARM commented 1 year ago

I'm wondering whether different memory bandwidth across these chips could explain these differences.

MikeJKelly commented 1 year ago

Closed due to inactivity, if this is still an issue for you can you please reopen the issue or create a new one. Best regards, Mike.