Open tao-stones opened 3 weeks ago
Looking for most efficient and reliable way, of gathering builtin instructions CU / Performance data, I'm aware of few options:
I'll start with bench first. I tried first to construct simple transaction that only contains necessary data for targeting instruction, for example StakeInstruction::Initialize
, then push the transaction through a bank for full scale load_execute_and_commit. This could work, but have too much overhead in bench result. I have since settled down by using invoke_context::mock_process_instruction()
in bencher.
@tao-stones could we write benches for each instruction, on some typical x86 machines, and count instructions via something like criterion-perf-events?
Then just use that number as approximate CUs? That seems like a very consistent way to do it instead of benchmarking time -> convert to CU via some previous constant of cu/us.
wdyt?
Benching is the way to go, already trying out on Stake
. Counting number of instructions is great way to have a consistent measurements (assume an execution path is chosen), @ksolana brought up similar ideas as well. Think worth trying. Also TIL criterion-perf-events
However, doing so will put builtins costs in different realm as the rest still in us/CU conversion land. So instead of builtins costs are inaccurate but stable relative to other instructions, they will be more accurate but not in same scale as the rest. Would that cause concern? (I need to think about it)
However, doing so will put builtins costs in different realm as the rest still in us/CU conversion land.
CUs used to just be bpf instructions. I think it's somewhat reasonable to translate from "typicaly x86 instruction" to 1 CU. It's an overestimate since obviously native vs emulated perf implications.
@tao-stones could we write benches for each instruction, on some typical x86 machines, and count instructions via something like criterion-perf-events?
Then just use that number as approximate CUs? That seems like a very consistent way to do it instead of benchmarking time -> convert to CU via some previous constant of cu/us.
wdyt?
Yeah, this is exactly what i was referring to during yesterday's conversation. We can hook up valgrind to get a 'trace' of instruction counts. There might be other ways to get it, but since this is a one off thing, using valgrind isn't a bad idea IMO. A bit tricky but once implemented it is scalable to all the instructions.
tried criterion-perf-events
on StakeInstruction::Initialize
, setting measure as Instructions
, it prints:
Gnuplot not found, using plotters backend
stakes/initialize time: [125501.2873 cycles 125501.9226 cycles 125502.5530 cycles]
change: [-0.0002% +0.0011% +0.0024%] (p = 0.12 > 0.05)
No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
1 (1.00%) high mild
3 (3.00%) high severe
not sure how to read this, are the cycles
are "count of instructions"?
Hmm i think I linked the wrong library looking at that output.
this is what i was thinking of: https://github.com/bheisler/iai
https://bheisler.github.io/criterion.rs/book/iai/getting_started.html
Ah nice! It already uses cachegrind(which is part of the valgrind). Simplifies our work 🔥
iirc iai
cannot exclude setup code, if true, then it is a deal breaker for this use case.
Can we compute the setup cost and subtract from every calculation?
That's doable, then the project is more like a one-time effort. Still worth trying. Maybe can have both criterion benchmarks and iai benches, then need something automatically subtract setup cost from iai results.
Also, do you guys know if iai
and valgrind
are stable on MacOS?
Got iai
work on linux, a result looks like this:
Note: simply subtracting these two Instructions
doesn't give correct answer, cause there are more "inner" pre-execution steps (eg invoke context and accounts priming) and post-execution steps.
Also, do you guys know if iai and valgrind are stable on MacOS?
it's a linux thing. MacOS stopped supporting valgrind in a reasonable way. And for these computations we want to measure X86 instruction count anyways.
@KirillLykov @ksolana just fyi, separated re-pricing builtin instruction into simd-198, updated this issue's description.
SIMD-198 calls for assigning static execution cost (in
CU
) for each builtin instruction, vs current per buitin programs, it'd help to improve CU allocation, therefore block packing quality.