Interest in getting these benchmarks easier to run across systems / NPUs - Githubissues

aallan / benchmarking-ml-on-the-edge

Benchmarking machine learning inferencing on embedded hardware.

MIT License

18 stars 1 forks source link

Interest in getting these benchmarks easier to run across systems / NPUs #1

Open geerlingguy opened 4 weeks ago

geerlingguy commented 4 weeks ago

Just today, Intel's CEO said:

"AI PC benchmarks are needed. We don't have proper comparisons yet."

Just marking my interest in helping get these benchmarks to run well across a variety of systems. This month my concrete goal is to try to get a benchmark to compare the Coral (supposedly 4 TOPS, tops) to the Hailo-8L (supposedly 13 TOPS).

It'd be nice if there were some benchmarks we can run that compare different NPUs somewhat fairly using real world scenarios, like what this repository does. Geekbench, Cinebench, etc. are decent for what they are, and this could be added to the testbench for all the upcoming 'AI PCs', 'Copilot+' PCs, Edge AI boxes, whatever.

aallan commented 4 weeks ago

Hey! Yes, these scripts were written — evolved really over the course of a number of articles — and aren't really set up to be automated, are fiddly to run, and they have no tests. I wrote them at a time when there was no standard benchmarks for models, and honestly I still think that the real world(ish) scenario I use here is far more illustrative of performance than some of the fancier and (in theory) more complete benchmarks.

One thing I'd desperately want to keep is simplicity, one of the things that these benchmarks don't do is (much) optimisation. They take an image, throw it at a model, and measure the result. The code is simple, and what it measures at that point is comparable to the performance an average developer doing the task might get, rather than a machine learning researcher than understands the complexities and limitations of the models and how to adapt them to individual platforms.

aallan commented 4 weeks ago

I think the first step is to make a decision about how to structure things. Right now the scripts are split by platform — Coral, Movidius, TensorFlow, TensorFlow Lite, OpenVINO (aka Nvidia), and Xnor. Unfortunately the Xnor models are no longer available (thanks for that Apple!) so we can drop that platform. Some of the earlier scripts have a confusing platform decision tree at the start right there in the Python, possibly that needs dropped and we need to break things out into separate scripts, one-per-platform. Then we can write a bash script to figure out the platform, and run the right script. This will probably simplify the main code in the scripts so that it is more easily read and understood by folks that want to come in and understand what's going on.

aallan commented 4 weeks ago

Ping (perhaps?) @petewarden and @dansitu for their comments?

Pete, Dan! These are the scripts (and models and the original image) I used back when I benchmarked accelerator hardware back in 2019. The code is slightly out of date, but changes to get it working again are probably fairly minimal — although installation of coralpy and other supporting things is somewhat problematic these days — Jeff and I have been discussing the need for a "real world" benchmarking for NPUs. We seem to have a fresh batch of nextgen NPUs popping up right now, e.g. Hailo and others. Thoughts?

aallan commented 4 weeks ago

I guess the first step is getting the code working again before a proper refactor.

aallan commented 4 weeks ago

So, dropping AI2GO as a framework as it's no longer available the frameworks used would be,

TensorFlow
TensorFlow Lite
OpenVINO (on Intel hardware)
Edge TPU (on Coral hardware)
TensorRT (on Nvidia hardware)

Hardware to test against would be,

Raspberry Pi
- ~Raspberry Pi 3, Model B+~ (no longer relevant?)
- Raspberry Pi 4
- Raspberry Pi 5
Intel
- ~Movidius Neural Compute Stick~ (no longer relevant?)
- Intel Neural Compute Stick 2
Nvidia
- Jetson Nano
Coral
- USB Accelerator (using USB3 only?)
- M.2 Accelerator A+E key (should get same result?)

What new hardware should we be looking at, and what frameworks does it need? Presumably Hailo, what else?

geerlingguy commented 3 weeks ago

I'd also like to see in general how other new platforms touting built-in NPUs fare—so like Apple M4, Intel Lunar Lake, Snapdragon X, and AMD Strix Point... some of these platforms are hard to come by (and may also be a bit weird), but hopefully have bindings we could use.

I think the shorter term goal is just making the tests simple and reproducible (make it really easy to run, maybe even a convenience script). Then making the depth of support grow if that first goal's met!

aallan commented 3 weeks ago

I'd also like to see in general how other new platforms touting built-in NPUs fare—so like Apple M4, Intel Lunar Lake, Snapdragon X, and AMD Strix Point... some of these platforms are hard to come by (and may also be a bit weird), but hopefully have bindings we could use.

That's going to heavily depend on software support. Just running TensorFlow on a lot of these platforms will work, but be unaccelerated by whatever built-in NPU the platform has onboard. I had started to look at the Beaglebone AI board, which had just been pre-announced as I went to work for Pi, but the software just wasn't there, and it seemed unfair to run unaccelerated TensorFlow models on the board and get a really bad result. Things seem more mature there now, and there does seem to be software, but it's all based on TI's TIDL framework.

Poking around the Hailo docs it looks like there are python bindings to the HailoRT framework. So we should be able to take the benchmark's TensorFlow model and convert it to HEF format for that family of hardware at least.

But this is the thing that really annoys me. Every manufacturer feels like they need to reinvent the wheel. Every new hardware platform has a new software framework. Even Google created a whole framework (two, first edgetpu and then pycoral) on top of TensorFlow Lite (which is their own software!) before you can use their Coral hardware. That means you have to jump through hoops to convert a normal TensorFlow model into whatever weird format you need this time around. Getting things working on Intel hardware with OpenVINO was especially hard!

I think the shorter term goal is just making the tests simple and reproducible (make it really easy to run, maybe even a convenience script). Then making the depth of support grow if that first goal's met!

Agreed. I'm trying to decide whether then "what platform am I on" decision should be made as close to inferencing as possible, or right at the start. The obvious architecture would be to ingest a TensorFlow model and then do the model conversion automagically. So the user just throws an arbitary model at the script and it figures out what architecture its on, converts it to the right format, and runs the benchmark. But that's almost certainly going to be impossible, converting a model is a very manual affair in most cases as it takes knowledge of the model internals.

aallan commented 2 weeks ago

Spun up an issue #2 to keep track of Hailo support.

dansitu commented 2 weeks ago

I wonder if there's any value in working with tinymlperf on merging/extending these benchmarks?

https://mlcommons.org/2021/06/mlperf-tiny-inference-benchmark/

nullr0ute commented 1 week ago

I'd also like to see in general how other new platforms touting built-in NPUs fare—so like Apple M4, Intel Lunar Lake, Snapdragon X, and AMD Strix Point... some of these platforms are hard to come by (and may also be a bit weird), but hopefully have bindings we could use.

Apple M4 uses arm's SME so it should be possible to test that with CPU code that has appropriate optimisations. Maybe using something like TinyGrad or lama.cpp

The Beaglebone AI board

There's now also a Beaglebone AI64. The Jetson Nano is also now EOL replaced by Jetson Orin Nano.

But this is the thing that really annoys me. Every manufacturer feels like they need to reinvent the wheel. Every new hardware platform has a new software framework.

Does it make sense to add a generic OpenCL/Vulkan compute test framework? Most small edge devices that have a GPU have the potential to do OpenCL if the drivers support it.