embench / embench-iot

The main Embench repository
https://www.embench.org/
GNU General Public License v3.0
259 stars 105 forks source link

Understanding Benchmark Speed score #137

Open LaurelinTheGold opened 3 years ago

LaurelinTheGold commented 3 years ago

I am a little confused, but this is my understanding of the speed benchmark from the readme, results, and looking at past issues.

  1. Each benchmark is linearly scaled to the processor clock. If the crc32 takes n cycles for a board at 1 MHz, it will take 2n cycles to run for a board at 2MHz.
  2. The real time taken by a benchmark is determined by the cycle count divided by the clock speed. The goal is so each benchmark takes ~4s to run.
  3. We normalize the real time by dividing by the MHz.
  4. The baseline times are not normalized.
  5. The embench speed score is the normalized time of the baseline divided by the normalized time of the chip being tested.
  6. The speed benchmark program as is outputs unnormalized times in absolute mode and ratio of unnormalized baseline to unnormalized test chip in relative mode.
  7. The embenches per MHz is the speed score with normalized times divided by MHz. This quantifies how much larger the embench speed score would be if the board was clocked 1MHz higher.
  8. Because the clock speed factors cancel, the speed benchmark outputting unnormalized time ratios actually directly gives the per MHz information (assuming the baseline is 1MHz) so the real speed score can be obtained by multiplying the outputted speed by the clock of the board being used.

Point 8 seems to be how the speed results are being recorded on the results repository.

Assuming this is correct, it would be helpful to update the readme to

Roger-Shepherd commented 3 years ago

Well LaurelinTheGold, you've raised a can of worms here. I thought the situation was unclear but explicable. However, it looks like we ended up with a discrepancy between theory (documentation) and practice. Anyhow, this is what I think is going on, I may be wrong. But whether I'm right or wrong, your point that we have a documentation problem remains valid.

When got Embench working on Apple Mac I also found the documentation about CPU_MHZ and normalised times confusing. The documentation is not helped by reporting speed in mS - mS is time, speed is 1/time. Fortunately the code says what actually happens and, eventually, I understood the code. (I think).

Before building on your comments, I want to address the number of iterations performed by each program in the the suite. This is determined by the product of two numbers LOCAL_SCALE_FACTOR and CPU_MHZ. LOCAL_SCALE_FACTOR is defined for each program and its purpose is to cause the execution time of the program on a nominal 1 MHz processor to be around 4s. (4s is chosen because it is long enough to make time measurements reliable, and short enough that the whole benchmark suite can be run in a reasonable time). CPU_MHZ is defined per target (processor) and its purpose is to scale the number of iterations so that a fast processor still takes around 4s to run the program. [Detail: the number of iterations is a compile time constant. This was to avoid problems with embedded systems where it would be difficult to pass a parameter at run time].

Moving to your comments:

  1. Yes. The assumption is that, once the caches are warmed, execution time is linear in the number of iterations. There is a proposal on the table for V2.0 Embench, that we run each program N time and 2N times, and use the difference as the (nominal) time for N iterations, avoiding a separate warming phase.
  2. Strictly, the time is determined by the routines start_trigger and stop_trigger, interpreted by decode_results in the Python file indicated by `--target-module. For the Mac I use the o/s function clock_gettime which gives me real time independent frequency. For systems which count cycles, the relationship time = cycles/frequency is used.
  3. The documentation says this but it never happens.
  4. True
  5. In fact the non-normalized times are used (which yields the same result).
    rel_data[bench] = baseline[bench] / raw_data[bench] # line 296 benchmark_speed.py The resulting score says how many times faster the platform being measured is than the reference.
  6. Yes.
  7. This is where I get confused.... The user guide in doc/README.md says:

    \"The reference CPU is an Arm Cortex M4 processor .... The reference platform is a ST Microelectronics STM32F4 Discovery Board ... using its default clock speed of 16MHz\".

"The benchmark value is the geometric mean of the relative speeds. A larger value means a faster platform. The range gives an indication of how much variability there is in this performance."

"In addition the geometric mean may then be divided by the value used for CPU_MHZ, to yield an Embench score per MHz.""

  1. (continued) So by definition the reference platform's benchmark value must be 1.0, and the Embench score per MHz must be 1.0/16 which is 0.625. However, the reported results in embench/embench-iot-results include a couple of 16 MHz M4s which report speeds of 14.4 and 16.0, and speeds per MHz of 0.9 and 0.93 respective. These are wrong by the definition above, and look like they are using a 1 MHz processor as the reference, i.e. as you describe.
  2. I think the definition in the documentation says something different (above) but you are describing what seems to be done.

Clearly there is a problem with the documentation. Someone (maybe me, maybe other people) are misinterpreting it to the extent that I think people are publishing wrong results. Regarding your suggestions:

  1. "emphasize the difference between the real time and the normalized time" I agree. The current state of affairs is not clear. (see my final comment also)
  2. "clarify that the workload is scaled by cpu clock" I agree.
  3. "clarify that 4 implies the baseline speed score will be 16" I think this is a change, not a clarification (but maybe I've misunderstood the wording)
  4. "clarify that 8 implies 6 so while 5 is true, the speed score is not what benchmark_speed.py outputs" I disagree. The score output is (I think) what the documentation defines.
  5. "clarify that 8 implies we do not need to worry about normalizing the time at all since we can just multiply the speed output by the cpu clock". Except I think this is wrong; it is what people are doing, but it seems to contradict the documentation.

I suspect we have to change the documentation to match usage, although from a quick look at a couple of papers which use Embench it seems people quote figures relative to there own baseline, so perhaps we can keep the definitions we have and correct our published results. (Or maybe I've got things wrong here).

Personally, I think we should define an Embench MIP which is the speed of a nominal 1 MIP processor running Embench (anyone want to port Embench to a vintage VAX 11/780?). This can be just a fudge factor to the reference score - if people think a 16MHz M4 is a worthy 1 MIP processor, then the reference platform would be a 16 Embench MIP platform.

Finally, for V2, I think we should have a scheme where there is a per-benchmark normalisation factor NF which works in place of CPU_MHZ to scale the execution time to about 4s - that is, the number of iterations would be LOCAL_SCALE_FACTOR * NF and the reported time would be actual time / NF. This allows for differences in performance characteristics of processors to be accommodated.

LaurelinTheGold commented 3 years ago

Hi Roger, most of what you are saying makes sense but I will disagree on point 5 that normalized time ratios yields the same results as unnormalized time ratios.

If the chip has a clock of C and the baseline benchmark takes N cycles, then the scaled cycles is CN and the time taken to run is CN/C=N. The normalized time to run would be N/C (C being unitless). If we set the baseline normalized time B=N_0/C_0, the ratio of the normalized times is B/(N/C)=BC/N=N_0C/(NC_0). The ratio of the real times is N/N_0=scorepermhz. Even if the baseline clock is set to 1MHz, the normalized time still gives scorepermhz*C where C is the clock speed of the chip being tested.

Is there a better way of finding papers that use embench other than google scholar searching embench?

I am not experienced enough in benchmarking lore to worry about V2 yet.

Thanks for the reply!

Roger-Shepherd commented 3 years ago

LaurelinTheGold,

"... I will disagree on point 5 that normalized time ratios yields the same results as unnormalized time ratios.

If the chip has a clock of C and the baseline benchmark takes N cycles, then the scaled cycles is CN and the time taken to run is CN/C=N. The normalized time to run would be N/C (C being unitless). If we set the baseline normalized time B=N_0/C_0, the ratio of the normalized times is B/(N/C)=BC/N=N_0C/(NC_0). The ratio of the real times is N/N_0=scorepermhz. Even if the baseline clock is set to 1MHz, the normalized time still gives scorepermhz*C where C is the clock speed of the chip being tested.

You are right.

Quickly thinking about my responses to 6, 7, and 8.

  1. We are in agreement about 6 and I think we are correct.

  2. I need to work through

  3. The problem is the baseline (reference platform) is 16 MHz.

You being right about 5 means I don't understand how the reporting is working! Line 549 of embench-iot/doc/README.md says "These computations are carried out by the benchmark scripts". I can't see that the "benchmarking scripts" in embench/iot do this, and from a quick look at embench/embench-iot-results I can't see a solution there. In particular, I can't see how the normalised reference results are produced. If the reference platform is run using --baseline-output the non-normalised results are output. There is a comment (line 243 in embench-iot-results/embres/data.py) which says ""# Speed data in file is per MHz" (i.e. normalised) but the only way I can see that being tru is if the results have been edited - the results produced by benchmark_speed.py aren't normalised.

Is there a better way of finding papers that use embench other than google scholar searching embench?

Not that I know of.

hirooih commented 2 years ago

@LaurelinTheGold, and @Roger-Shepherd,

I also confused by the description in "Computing a benchmark value for speed" section. Your discussion above helps me to understand it.

Here is my summary.


size

speed

CPIter[bench] := Cycles Per a Iteration of each benchmark
LOCAL_SCALE_FACTOR[bench] := 4,000,000/(CPIter[bench] of Cortex M4)
== (the total number of cycle of a benchmark) / CPU_MHZ / 1000
== (CPIter[bench] * LOCAL_SCALE_FACTOR[bench] * CPU_MHZ) / CPU_MHZ / 1000
== CPIter[bench] * LOCAL_SCALE_FACTOR[bench] / 1000
== raw_data[bench] (=~ 4000 for Cortex M4 -O2)
rel_data[bench] = baseline[bench] / raw_data[bench]

The current last two paragraph:

The benchmark value is the geometric mean of the relative speeds. A larger value means a faster platform. The range gives an indication of how much variability there is in this performance.

In addition the geometric mean may then be divided by the value used for CPU_MHZ, to yield an Embench score per MHz. This is an indication of the efficiency of the platform in carrying out computation.

How about change this as follows?

The geometric mean yields an Embench speed score per MHz. This is an indication of the efficiency of the platform in carrying out computation. A larger value means a more efficient platform. The range gives an indication of how much variability there is in this performance.

In addition the geometric mean then is multiplied by the value used for CPU_MHZ. A larger value means a faster platform.

If you agree with me, shall I send a PR?