Closed matsbror closed 1 year ago
Do you see similar cycle numbers using the -m perf-counters
option? That uses the perf_event_open
API, while our default cycles
measure uses some crate with architecture-specific inline asm that I don't necessarily trust as much.
Unfortunately, I have not yet been able to run with this interface. It's on my list to do.
I could not get the performance counters in sightglass-cli
to work so I ran it with the perf stats
command line tool. This is what I got:
$ perf stat -e cycles target/release/sightglass-cli benchmark --engine engines/wasmtime/libengine.so --processes 1 --pin -- benchmarks/spidermonkey/benchmark.wasm
compilation
benchmarks/spidermonkey/benchmark.wasm
cycles
[16715434572 16823312253.40 16847510374] engines/wasmtime/libengine.so
instantiation
benchmarks/spidermonkey/benchmark.wasm
cycles
[392988 508229.80 749048] engines/wasmtime/libengine.so
execution
benchmarks/spidermonkey/benchmark.wasm
cycles
[824692334 828807012.60 834214406] engines/wasmtime/libengine.so
Performance counter stats for 'target/release/sightglass-cli benchmark --engine engines/wasmtime/libengine.so --processes 1 --pin -- benchmarks/spidermonkey/benchmark.wasm':
16,261,475 cpu_core/cycles/ (21.24%)
242,257,066,898 cpu_atom/cycles/ (99.99%)
60.716844846 seconds time elapsed
60.455452000 seconds user
0.255980000 seconds sys
I need help interpreting this as I do not see any correleation between the cycles reported by sightglass and the cycles reported by perf stat
. It seems as if the atom cpu was used instead of the more powerful core. Is there a way to direct which cpu to use?
Recall that Sightglass attempts to precisely measure each WebAssembly phase: compilation, instantiation, execution. So the results from Sightglass above are averages for each phase. You set --processes 1
but by default --iterations-per-process
will run through the phases 10
times. So I would expect that if you multiplied each phase by 10
and added them together, you might see close results to perf
, minus any minimal Sightglass infrastructure cycles that run between the phases.
As you probably observed, --pin
only really keeps the workload on the core it was assigned to. You might want to try something like taskset --cpu-list <?>
to select which core to use. Also, @jameysharp wrote a good guide for truly isolating a single core which you can find here.
So I have some new data here. I managed to run sightglass with performance coutners on all three architectures (x86_64, aaarc64 and riscv64). It was not always easy. In particular for risdv64 I had to find un-merged pull-requests of the precision
and perf-event
crates, but in the end I managed to run most benchmarks.
Here is the graph of cycle coutns using performance counters instead of the default cycle measurement.
To me it makes much more sense than the one I included above. Given that the x86_processor I am using (i7-13800H) is a highly efficient processor in terms of instruction level parallelism, it is reasonable it has the lowest cycle count followed by the ARM64 processor in the Tegra TX2 board I am using and finally the StarFive RISCV processor in the VisionFive boards which is on the low-end of instruction level parallelism.
I am not ready to close this issue yet because, why is the default measurement of cycles in sightglass showing such weird numbers? Did anyone verify this?
The story continues: there is definitely something strange with the standard cycle counting on x86_64. I followed the instructions here to isolate execution on one core (although I had to research a bit to find out the cpu numbering on my Alder Lake processor).
The result is consistent with the graph above.
Below is an excerpt of the raw csv data:
mats@UNI12GPKS3:~/wasm/sightglass-results$ cat counters-2.csv | grep "blake3.scalar"|grep Execution|grep x86_64|grep cpu-cycles
x86_64,engines/wasmtime/libengine.so,benchmarks/blake3-scalar/benchmark.wasm,17206,0,Execution,cpu-cycles,304130
x86_64,engines/wasmtime/libengine.so,benchmarks/blake3-scalar/benchmark.wasm,17206,1,Execution,cpu-cycles,302927
x86_64,engines/wasmtime/libengine.so,benchmarks/blake3-scalar/benchmark.wasm,17206,2,Execution,cpu-cycles,309163
x86_64,engines/wasmtime/libengine.so,benchmarks/blake3-scalar/benchmark.wasm,17206,3,Execution,cpu-cycles,301744
x86_64,engines/wasmtime/libengine.so,benchmarks/blake3-scalar/benchmark.wasm,17206,4,Execution,cpu-cycles,307216
x86_64,engines/wasmtime/libengine.so,benchmarks/blake3-scalar/benchmark.wasm,17206,5,Execution,cpu-cycles,302786
x86_64,engines/wasmtime/libengine.so,benchmarks/blake3-scalar/benchmark.wasm,17206,6,Execution,cpu-cycles,295956
x86_64,engines/wasmtime/libengine.so,benchmarks/blake3-scalar/benchmark.wasm,17206,7,Execution,cpu-cycles,300327
x86_64,engines/wasmtime/libengine.so,benchmarks/blake3-scalar/benchmark.wasm,17206,8,Execution,cpu-cycles,308237
x86_64,engines/wasmtime/libengine.so,benchmarks/blake3-scalar/benchmark.wasm,17206,9,Execution,cpu-cycles,297220
mats@UNI12GPKS3:~/wasm/sightglass-results$ cat cycles-2.csv | grep "blake3.scalar"|grep Execution|grep x86_64
x86_64,engines/wasmtime/libengine.so,benchmarks/blake3-scalar/benchmark.wasm,15157,0,Execution,cycles,2341224
x86_64,engines/wasmtime/libengine.so,benchmarks/blake3-scalar/benchmark.wasm,15157,1,Execution,cycles,2190532
x86_64,engines/wasmtime/libengine.so,benchmarks/blake3-scalar/benchmark.wasm,15157,2,Execution,cycles,2329172
x86_64,engines/wasmtime/libengine.so,benchmarks/blake3-scalar/benchmark.wasm,15157,3,Execution,cycles,2183376
x86_64,engines/wasmtime/libengine.so,benchmarks/blake3-scalar/benchmark.wasm,15157,4,Execution,cycles,2193892
x86_64,engines/wasmtime/libengine.so,benchmarks/blake3-scalar/benchmark.wasm,15157,5,Execution,cycles,2229460
x86_64,engines/wasmtime/libengine.so,benchmarks/blake3-scalar/benchmark.wasm,15157,6,Execution,cycles,2475904
x86_64,engines/wasmtime/libengine.so,benchmarks/blake3-scalar/benchmark.wasm,15157,7,Execution,cycles,2198780
x86_64,engines/wasmtime/libengine.so,benchmarks/blake3-scalar/benchmark.wasm,15157,8,Execution,cycles,2209824
x86_64,engines/wasmtime/libengine.so,benchmarks/blake3-scalar/benchmark.wasm,15157,9,Execution,cycles,2245676
As you can see, the cycle counts are 10x compared to when using coutners. For aarch64 and riscv64 the difference is of similar magnitude but in the other direction. Since the counter results are more logical, I tend to believe them more than the ones from using the precision crate.
I suspect this issue is not directly related to Sightglass — the measurement mechanisms we use are all external (see crates/recorder/src/measure
). But it affects Sightglass so it makes sense to get to the bottom of this. Perhaps the thing to do here is to eliminate all of the Sightglass infrastructure altogether and validate the precision
and perf-event
crates on your systems. E.g., if we could create a workload that we know exactly executes N instructions, we could measure that workload with both precision
and perf-event
and see if the results still differ widely there.
As you are probably aware, cycle-based measurement is problematic (e.g., frequency scaling!). And these cycle measurements you are looking at are being collected in different ways:
precision
crate is wrapping a native "look at the clock" instruction, like RDTSC
on x86perf-event
crate is using perf
's CPU_CYCLES
event, which may be doing something else (maybe instead we should be using REF_CPU_CYCLES
?)So it may be that there is some irreconcilable difference here (which would be nice to discover!) but that Sightglass may not be able to solve. The good thing is that Sightglass accepts multiple measurement types so if the ones you mention are not cutting it for this cross-architecture benchmarking, you could add another. Others in the past have mentioned that cycles
is not the best measurement to have as a default; perhaps for what you're doing here you want to look at adding perf
's task-clock
event or even adding a new wall-clock
measurement (cc: @jlb6740, @fitzgen).
In any case, thanks for looking into this!
In fact, for my purpose, cycle count is excellent as it makes it possible to compare different architectures without being too concerned with clock frequency. A cycle is a cycle.
My next step would be, as you point out, to validate precision and perf_event crates against each other. It is indeed not related to sightglass per se.
FWIW, I trust perf_event_open
much more than the crate to read rdtsc
. Unfortunately, it is Linux-only, or else I would advocate for making it the default (and even remove support for the other crate).
FWIW, I would be fine with adding a std::time
-based Measure
impl as well.
Finally, I'm not sure what your end goal here is @matsbror, but FWIW sightglass has mainly been curated for Wasmtime/Cranelift hackers to see whether certain PRs/optimization passes/etc improve or regress performance. It probably is not as well suited to other use cases, if only because it hasn't ever been used or evaluated for other use cases as much.
I've been reading up a little about what the Timestamp counter on x86_64 processors really does. As far as I can tell it's a processor-wide counter running on a steady-clock rate. What that rate is, I have not been able to find out.
So I did an experiment setting the clock frequency to 5GHz and 500 MHz, respectively. If my conjecture is right, the cpu_cycles should stay the same, but the TSC reading should be 10 times higher with the slower core clock cycle.
5 GHz (running blake3-scalar benchmark, Execution phase)
execution with perf_event cycles
benchmarks/blake3-scalar/benchmark.wasm
cpu-cycles
[295336 299386.80 318583] engines/wasmtime/libengine.so
execution with TSC-reading
benchmarks/blake3-scalar/benchmark.wasm
cycles
[173548 175773.80 186944] engines/wasmtime/libengine.so
500 MHz (running blake3-scalar benchmark, Execution phase)
execution perf_event reading
benchmarks/blake3-scalar/benchmark.wasm
cpu-cycles
[301793 302848.10 303616] engines/wasmtime/libengine.so
execution with TSC reading
benchmarks/blake3-scalar/benchmark.wasm
cycles
[1748532 1826605.60 1999720] engines/wasmtime/libengine.so
Indeed my conjecture seems right. The cpu cycles are about 300 k in both cases but 175 k in the fast clock speed and 1.8 M in the slow clock speed.
That resolves the issue for my part because now I understand that the precision
crate cannot be used to compare architectures as what is measured is different on each architecture. It's, however, perfect for sightglass' purpose because it tracks time accurately.
Thanks for doing that experiment! I should probably refer to this somewhere in some documentation to preempt questions about this type of thing in the future... but I don't yet know where to put it. So you're going to end up using perf
's cycles then?
I have run most sightglass benchmarks on x86_64 (on laptop), on ARM65 (Nvidia Jetson board) and on RISCV64 (VIsionFive64 board) and find that the cycle-times are so different. The number of cycles for all phases are 1-3 orders of magnitude larger for x86_64 than for the other architectures. See figure below (note log scale on y-axis).
I wonder if there is a difference in how cycles are measured for the three different architectures. I will delve into the source code, but if someone has a ready made answer it would be great to take part of it.