NEMSLinux / legacy-nems-scripts

System scripts located in /usr/local/share/nems/nems-scripts on NEMS Linux
GNU General Public License v3.0
3 stars 5 forks source link

benchmarking approach is insecure and generates only numbers without meaning #2

Closed ThomasKaiser closed 5 years ago

ThomasKaiser commented 5 years ago

See https://www.cnx-software.com/2019/04/30/giggle-score-odroid-n2-best-value-raspberry-pi-zero-worst-value/#comments

Cat5TV commented 5 years ago

Thank you, @ThomasKaiser - I have posted a reply.

Cat5TV commented 5 years ago

v2 has been released to address the issues raised.

ThomasKaiser commented 5 years ago

And this is now how you try to determine the board's performance? Averaged 7z score?

Cat5TV commented 5 years ago

It's a fine start, though I'll be adding to it - it's the first iteration after deprecating sysbench. I'd love to hear your ideas and those of the community.

ThomasKaiser commented 5 years ago

I don't get the reason behind adding two single-threaded 7-zip scores and then averaging it. It messes up both single-threaded performance scores as well as multi-threaded scores.

Just imagine an octa-core A53 device like NanoPi Fire3 and a quad-core A53 running at the same clockspeed. Your score will show them both perform identical while with multi-threaded loads the octa-core thing is almost twice as fast. With workloads that scale well with count of CPU cores (number crunching, build farms, rendering, server workloads in general) the amount of CPU cores is important as well. If you hide this you just generate new numbers without meaning.

Wrt single-threaded performance this averaging approach unfortunately is also wrong. With use cases like 'light desktop' due to most software still being coded in a single-threaded fashion on big.LITTLE/DynamIQ designs that feature both slow and fast cores almost only the single-threaded of these fast cores is important as long as there are at least 2 of them.

The following list contains 7-zip MIPS scores for first and last CPU core, then your averaged score and then the multi-threaded 7-zip score fully utilizing all CPU cores (all results from https://github.com/ThomasKaiser/sbc-bench/blob/master/Results.md):

With your attempt to average 7z b scores from cpu0 and last cpu you create the impression that XU4 would be at 150% the performance of an RPi 3 B+. While with single-threaded use cases it's 190% in reality (you need to look at a big core that scores 1633 7-zip MIPS and compare this to the RPi's 856) and with multi-threaded loads it's even 230% (3050 vs. 7100 7-zip MIPS).

Same picture when looking at the other more modern 6-core SoCs RK3399 and S922X. Not taking count of CPU cores into account is wrong when talking about multi-threaded workloads. Not solely looking at the fast cores to determine single-threaded performance is wrong with big.LITTLE/DynamIQ SoCs.

And the whole approach is wrong anyway since there is not a single performance metric that would properly describe how fast a specific SBC is with a specific use case. It's always about the use case first and such 'single metrics' are just misleading. Since for different use cases different performance criteria are important. And using 7-zip's internal benchmark mode to measure 'overall SBC performance' is as wrong as using sysbench since it's still ignoring that there are many different use cases for which different things are important (not even talking here about the flawed 'algorithm' to combine two 7z b scores to a meaningless number).

Cat5TV commented 5 years ago

Thanks, that's great information. So are you saying rather than running single threaded tests on first and last core, I should be running multi-threaded across all cores as well? My goal is obviously to find the best way to find a reasonable average for each SBC.

Use case isn't as big a deal, since really we just need to provide a decent comparative. Though I have given thought to including other use case-centric benchmarks to further allow people to hone in on what matters most to them. But at this early stage, that's not yet a priority ('till we get this initial metric as accurate as possible).

So, perhaps in bullet form, what would you say would be the "most accurate" use case-agnostic means of comparing a Raspberry Pi 3 B+ to an ODROID-XU4?

I appreciate the assistance; thanks!

ThomasKaiser commented 5 years ago

My goal is obviously to find the best way to find a reasonable average for each SBC

That's impossible for the simple reason that CPU horsepower alone for many use cases is less important than other stuff and even when looking at CPU horsepower still the use case dictates which performance metric is important and which not.

It's always about use case first and even then this whole comparison game is not purely related to hardware but unfortunately also to software/settings.

Choosing the right SBC for the job is not a matter of staring at graphs that hopefully somewhat correctly describe certain performance metrics. It's all about knowing the relevant details. Educating SBC users is the important task and not inventing another bunch of useless numbers (Phoronix already exists and collects an unbelievable amount of meaningless numbers over at https://openbenchmarking.org)

Cat5TV commented 5 years ago

BOB addresses this issue. I will continue development, thank you for your input.