Closed psyhtest closed 1 year ago
I've got another interesting observation comparing two Raspberry Pi 4's: 322 ms on a new Model B Rev 1.4 8GB @ 1800 MHz vs 315 ms on a Model B Rev 1.1 4GB @ 1500 MHz. It seems the higher the frequency, the worse the latency gets. On the new RPi4, utilization is about ~93%.
Thanks Anton - would you be able to attach the Arm NN event profiles from the two runs? Perhaps for some reason sub-optimal kernels are being selected, and we should be able to see that in the profiles.
@MatthewARM, how do I dump event profiles? Can I do it from a release build?
Sorry Anton, somehow I missed this message.
If you're still curious:
I think it's enabled with the "-e" option to ExecuteNetwork, if that's the tool being used for the benchmark? It's absolutely available in a release build.
If CK is using the Arm NN API then there's good instructions in the answer #464
Hope that helps, Matthew
Thank you @MatthewARM. For the official MLPerf Inference v2.0 submission, we measured 339 ms for Odroid N2+ and 349 ms for Raspberry Pi 4. Seems like a RPi4 regression to me (314/349 => -10%).
Hi @psyhtest,
Thank you for getting in touch. As this was a while ago, it's quite possible that the regression could have been fixed or even improved upon as there are always optimizations being made to the Neon backend.
Would it be possible to run the tests using the latest version of Arm NN and Arm Compute Library with profiling enabled? We would be more than happy to take a look at your profiling results if it's still occurring, as Matthew had mentioned above, to see if it's related to a sub-optimal kernels being selected.
Kind regards,
Matthew
I'm wondering whether different memory bandwidth across these chips could explain these differences.
Closed due to inactivity, if this is still an issue for you can you please reopen the issue or create a new one. Best regards, Mike.
I've noticed a curious thing when benchmarking ResNet50 via ArmNN v21.11 with the Neon backend on Odroid N2+ @ 2208 MHz and Raspberry Pi 4 @ 1500 MHz. Despite the 46% higher clock frequency, N2+ is actually 10% slower than RPi4: 342 ms vs 315 ms. During the execution, N2+ only uses its 4 big cores at ~50% and does not use its 2 LITTLE cores at all, while RPi4 uses its 4 big cores at ~100%.
You can follow this Jupyter notebook to reproduce with the following updates to the Performance measurement commands: