benchmarking approach is insecure and generates only numbers without meaning

ThomasKaiser commented 5 years ago

See https://www.cnx-software.com/2019/04/30/giggle-score-odroid-n2-best-value-raspberry-pi-zero-worst-value/#comments

Cat5TV commented 5 years ago

Thank you, @ThomasKaiser - I have posted a reply.

Cat5TV commented 5 years ago

v2 has been released to address the issues raised.

ThomasKaiser commented 5 years ago

And this is now how you try to determine the board's performance? Averaged 7z score?

Cat5TV commented 5 years ago

It's a fine start, though I'll be adding to it - it's the first iteration after deprecating sysbench. I'd love to hear your ideas and those of the community.

ThomasKaiser commented 5 years ago

I don't get the reason behind adding two single-threaded 7-zip scores and then averaging it. It messes up both single-threaded performance scores as well as multi-threaded scores.

Just imagine an octa-core A53 device like NanoPi Fire3 and a quad-core A53 running at the same clockspeed. Your score will show them both perform identical while with multi-threaded loads the octa-core thing is almost twice as fast. With workloads that scale well with count of CPU cores (number crunching, build farms, rendering, server workloads in general) the amount of CPU cores is important as well. If you hide this you just generate new numbers without meaning.

Wrt single-threaded performance this averaging approach unfortunately is also wrong. With use cases like 'light desktop' due to most software still being coded in a single-threaded fashion on big.LITTLE/DynamIQ designs that feature both slow and fast cores almost only the single-threaded of these fast cores is important as long as there are at least 2 of them.

The following list contains 7-zip MIPS scores for first and last CPU core, then your averaged score and then the multi-threaded 7-zip score fully utilizing all CPU cores (all results from https://github.com/ThomasKaiser/sbc-bench/blob/master/Results.md):

RPi 3 B+: 856/856 --> 856 vs. 3050
ODROID XU4: 923/1633 --> 1278 vs. 7100
NanoPi NEO4: 1093/1814 --> 1453 vs. 6600
ODROID N2: 1187/1669 --> 1428 vs. 8150

With your attempt to average 7z b scores from cpu0 and last cpu you create the impression that XU4 would be at 150% the performance of an RPi 3 B+. While with single-threaded use cases it's 190% in reality (you need to look at a big core that scores 1633 7-zip MIPS and compare this to the RPi's 856) and with multi-threaded loads it's even 230% (3050 vs. 7100 7-zip MIPS).

Same picture when looking at the other more modern 6-core SoCs RK3399 and S922X. Not taking count of CPU cores into account is wrong when talking about multi-threaded workloads. Not solely looking at the fast cores to determine single-threaded performance is wrong with big.LITTLE/DynamIQ SoCs.

And the whole approach is wrong anyway since there is not a single performance metric that would properly describe how fast a specific SBC is with a specific use case. It's always about the use case first and such 'single metrics' are just misleading. Since for different use cases different performance criteria are important. And using 7-zip's internal benchmark mode to measure 'overall SBC performance' is as wrong as using sysbench since it's still ignoring that there are many different use cases for which different things are important (not even talking here about the flawed 'algorithm' to combine two 7z b scores to a meaningless number).

Cat5TV commented 5 years ago

Thanks, that's great information. So are you saying rather than running single threaded tests on first and last core, I should be running multi-threaded across all cores as well? My goal is obviously to find the best way to find a reasonable average for each SBC.

Use case isn't as big a deal, since really we just need to provide a decent comparative. Though I have given thought to including other use case-centric benchmarks to further allow people to hone in on what matters most to them. But at this early stage, that's not yet a priority ('till we get this initial metric as accurate as possible).

So, perhaps in bullet form, what would you say would be the "most accurate" use case-agnostic means of comparing a Raspberry Pi 3 B+ to an ODROID-XU4?

I appreciate the assistance; thanks!

ThomasKaiser commented 5 years ago

My goal is obviously to find the best way to find a reasonable average for each SBC

That's impossible for the simple reason that CPU horsepower alone for many use cases is less important than other stuff and even when looking at CPU horsepower still the use case dictates which performance metric is important and which not.

Even if the RPi 3 B+ would be equipped with 16 Cortex A75 at 2.5 GHz and outperform every other SBC both wrt single-threaded and multi-threaded CPU performance it still would suck for every task that involves storage or network since all RPi suffer from being crippled designs only having one single USB2 port to the outside. They're bottlenecked for everything storage/network and CPU performance won't change anything here.
Even if SBC A has twice the CPU horsepower than SBC B if the latter can make use of HW acceleration for the VPU (video decoding/encoding/transcoding) and/or the GPU (webgl or retro gaming) then B will be the better choice for watching video or doing 3D stuff. CPU performance in this case doesn't matter since special engines inside the SoC outperform the CPU cores by magnitudes.
Even if a specific SBC on paper outperforms another one in all relevant areas (that's not only CPU horsepower but also networking/storage bandwidth/latency for the NAS use case for example) due to ignorance or lack of interested developers the 'faster' SBC can perform poorly compared to another SBC that received already some love by devs (applies to ODROID N2 vs. HC2 for example -- Looking at benchmarks N2 should be the better NAS while in reality the old and boring HC2 clearly outperforms every N2 and it might take many more months or even another year until this is fixed)

It's always about use case first and even then this whole comparison game is not purely related to hardware but unfortunately also to software/settings.

Choosing the right SBC for the job is not a matter of staring at graphs that hopefully somewhat correctly describe certain performance metrics. It's all about knowing the relevant details. Educating SBC users is the important task and not inventing another bunch of useless numbers (Phoronix already exists and collects an unbelievable amount of meaningless numbers over at https://openbenchmarking.org)

Cat5TV commented 5 years ago

BOB addresses this issue. I will continue development, thank you for your input.

NEMSLinux / legacy-nems-scripts

benchmarking approach is insecure and generates only numbers without meaning #2