Open LaurelinTheGold opened 3 years ago
Well LaurelinTheGold, you've raised a can of worms here. I thought the situation was unclear but explicable. However, it looks like we ended up with a discrepancy between theory (documentation) and practice. Anyhow, this is what I think is going on, I may be wrong. But whether I'm right or wrong, your point that we have a documentation problem remains valid.
When got Embench working on Apple Mac I also found the documentation
about CPU_MHZ
and normalised times confusing. The documentation is not helped by reporting speed in mS - mS is time, speed is 1/time. Fortunately the code says what actually happens and, eventually, I understood the code. (I think).
Before building on your comments, I want to address the number of
iterations performed by each program in the the suite. This is determined by the product of two numbers LOCAL_SCALE_FACTOR
and CPU_MHZ
. LOCAL_SCALE_FACTOR
is defined for each program and its purpose is to cause the execution time of the program on a nominal 1 MHz processor to be around 4s. (4s is chosen because it is long enough to make time measurements reliable, and short enough that the whole benchmark suite can be run in a reasonable time). CPU_MHZ
is defined per target (processor) and its purpose is to scale the number of iterations so that a fast processor still takes around 4s to run the program. [Detail: the number of iterations is a compile time constant. This was to avoid problems with embedded systems where it would be difficult to pass a parameter at run time].
Moving to your comments:
start_trigger
and stop_trigger
, interpreted by decode_results
in the Python file indicated by `--target-module
. For the Mac I use the o/s function clock_gettime
which gives me real time independent frequency. For systems which count cycles, the relationship time = cycles/frequency
is
used.rel_data[bench] = baseline[bench] / raw_data[bench] # line 296 benchmark_speed.py
The resulting score says how many times faster the platform being measured is than the reference. doc/README.md
says:
\"The reference CPU is an Arm Cortex M4 processor .... The reference platform is a ST Microelectronics STM32F4 Discovery Board ... using its default clock speed of 16MHz\".
"The benchmark value is the geometric mean of the relative speeds. A larger value means a faster platform. The range gives an indication of how much variability there is in this performance."
"In addition the geometric mean may then be divided by the value used for CPU_MHZ, to yield an Embench score per MHz.""
embench/embench-iot-results
include a couple of 16 MHz M4s which report speeds of 14.4 and 16.0, and speeds per MHz of 0.9 and 0.93 respective. These are wrong by the definition above, and look like they are using a 1 MHz processor as the reference, i.e. as you describe.Clearly there is a problem with the documentation. Someone (maybe me, maybe other people) are misinterpreting it to the extent that I think people are publishing wrong results. Regarding your suggestions:
I suspect we have to change the documentation to match usage, although from a quick look at a couple of papers which use Embench it seems people quote figures relative to there own baseline, so perhaps we can keep the definitions we have and correct our published results. (Or maybe I've got things wrong here).
Personally, I think we should define an Embench MIP which is the speed of a nominal 1 MIP processor running Embench (anyone want to port Embench to a vintage VAX 11/780?). This can be just a fudge factor to the reference score - if people think a 16MHz M4 is a worthy 1 MIP processor, then the reference platform would be a 16 Embench MIP platform.
Finally, for V2, I think we should have a scheme where there is a per-benchmark normalisation factor NF
which works in place of CPU_MHZ
to scale the execution time to about 4s - that is, the number of iterations would be LOCAL_SCALE_FACTOR * NF
and the reported time would be actual time / NF
. This allows for differences in performance characteristics of processors to be accommodated.
Hi Roger, most of what you are saying makes sense but I will disagree on point 5 that normalized time ratios yields the same results as unnormalized time ratios.
If the chip has a clock of C and the baseline benchmark takes N cycles, then the scaled cycles is CN and the time taken to run is CN/C=N. The normalized time to run would be N/C (C being unitless). If we set the baseline normalized time B=N_0/C_0, the ratio of the normalized times is B/(N/C)=BC/N=N_0C/(NC_0). The ratio of the real times is N/N_0=scorepermhz. Even if the baseline clock is set to 1MHz, the normalized time still gives scorepermhz*C where C is the clock speed of the chip being tested.
Is there a better way of finding papers that use embench other than google scholar searching embench?
I am not experienced enough in benchmarking lore to worry about V2 yet.
Thanks for the reply!
LaurelinTheGold,
"... I will disagree on point 5 that normalized time ratios yields the same results as unnormalized time ratios.
If the chip has a clock of C and the baseline benchmark takes N cycles, then the scaled cycles is CN and the time taken to run is CN/C=N. The normalized time to run would be N/C (C being unitless). If we set the baseline normalized time B=N_0/C_0, the ratio of the normalized times is B/(N/C)=BC/N=N_0C/(NC_0). The ratio of the real times is N/N_0=scorepermhz. Even if the baseline clock is set to 1MHz, the normalized time still gives scorepermhz*C where C is the clock speed of the chip being tested.
You are right.
Quickly thinking about my responses to 6, 7, and 8.
We are in agreement about 6 and I think we are correct.
I need to work through
The problem is the baseline (reference platform) is 16 MHz.
You being right about 5 means I don't understand how the reporting is working! Line 549 of embench-iot/doc/README.md
says "These computations are carried out by the benchmark scripts". I can't see that the "benchmarking scripts" in embench/iot
do this, and from a quick look at embench/embench-iot-results
I can't see a solution there. In particular, I can't see how the normalised reference results are produced. If the reference platform is run using --baseline-output
the non-normalised results are output. There is a comment (line 243 in embench-iot-results/embres/data.py
) which says ""# Speed data in file is per MHz" (i.e. normalised) but the only way I can see that being tru is if the results have been edited - the results produced by benchmark_speed.py
aren't normalised.
Is there a better way of finding papers that use embench other than google scholar searching embench?
Not that I know of.
@LaurelinTheGold, and @Roger-Shepherd,
I also confused by the description in "Computing a benchmark value for speed" section. Your discussion above helps me to understand it.
Here is my summary.
CPIter[bench] := Cycles Per a Iteration of each benchmark
LOCAL_SCALE_FACTOR[bench] := 4,000,000/(CPIter[bench] of Cortex M4)
== (the total number of cycle of a benchmark) / CPU_MHZ / 1000
== (CPIter[bench] * LOCAL_SCALE_FACTOR[bench] * CPU_MHZ) / CPU_MHZ / 1000
== CPIter[bench] * LOCAL_SCALE_FACTOR[bench] / 1000
== raw_data[bench] (=~ 4000 for Cortex M4 -O2)
rel_data[bench] = baseline[bench] / raw_data[bench]
(relative score) * CPU_MHZ
The current last two paragraph:
The benchmark value is the geometric mean of the relative speeds. A larger value means a faster platform. The range gives an indication of how much variability there is in this performance.
In addition the geometric mean may then be divided by the value used for CPU_MHZ, to yield an Embench score per MHz. This is an indication of the efficiency of the platform in carrying out computation.
How about change this as follows?
The geometric mean yields an Embench speed score per MHz. This is an indication of the efficiency of the platform in carrying out computation. A larger value means a more efficient platform. The range gives an indication of how much variability there is in this performance.
In addition the geometric mean then is multiplied by the value used for CPU_MHZ. A larger value means a faster platform.
If you agree with me, shall I send a PR?
I am a little confused, but this is my understanding of the speed benchmark from the readme, results, and looking at past issues.
Point 8 seems to be how the speed results are being recorded on the results repository.
Assuming this is correct, it would be helpful to update the readme to