anza-xyz / agave

Web-Scale Blockchain for fast, secure, scalable, decentralized apps and marketplaces.
https://www.anza.xyz/
Apache License 2.0
450 stars 220 forks source link

Measure builtin instruction performance #3364

Open tao-stones opened 3 weeks ago

tao-stones commented 3 weeks ago

SIMD-198 calls for assigning static execution cost (in CU) for each builtin instruction, vs current per buitin programs, it'd help to improve CU allocation, therefore block packing quality.

Instruction issue/PR status
System https://github.com/anza-xyz/agave/pull/3567 merged
Stake https://github.com/anza-xyz/agave/pull/3396 open
Loader https://github.com/anza-xyz/agave/pull/3659 open
Vote
ALT
ComputeBudget https://github.com/anza-xyz/agave/pull/3614 merged
tao-stones commented 3 weeks ago

Looking for most efficient and reliable way, of gathering builtin instructions CU / Performance data, I'm aware of few options:

I'll start with bench first. I tried first to construct simple transaction that only contains necessary data for targeting instruction, for example StakeInstruction::Initialize, then push the transaction through a bank for full scale load_execute_and_commit. This could work, but have too much overhead in bench result. I have since settled down by using invoke_context::mock_process_instruction() in bencher.

apfitzge commented 3 weeks ago

@tao-stones could we write benches for each instruction, on some typical x86 machines, and count instructions via something like criterion-perf-events?

Then just use that number as approximate CUs? That seems like a very consistent way to do it instead of benchmarking time -> convert to CU via some previous constant of cu/us.

wdyt?

tao-stones commented 3 weeks ago

Benching is the way to go, already trying out on Stake. Counting number of instructions is great way to have a consistent measurements (assume an execution path is chosen), @ksolana brought up similar ideas as well. Think worth trying. Also TIL criterion-perf-events

However, doing so will put builtins costs in different realm as the rest still in us/CU conversion land. So instead of builtins costs are inaccurate but stable relative to other instructions, they will be more accurate but not in same scale as the rest. Would that cause concern? (I need to think about it)

apfitzge commented 3 weeks ago

However, doing so will put builtins costs in different realm as the rest still in us/CU conversion land.

CUs used to just be bpf instructions. I think it's somewhat reasonable to translate from "typicaly x86 instruction" to 1 CU. It's an overestimate since obviously native vs emulated perf implications.

ksolana commented 3 weeks ago

@tao-stones could we write benches for each instruction, on some typical x86 machines, and count instructions via something like criterion-perf-events?

Then just use that number as approximate CUs? That seems like a very consistent way to do it instead of benchmarking time -> convert to CU via some previous constant of cu/us.

wdyt?

Yeah, this is exactly what i was referring to during yesterday's conversation. We can hook up valgrind to get a 'trace' of instruction counts. There might be other ways to get it, but since this is a one off thing, using valgrind isn't a bad idea IMO. A bit tricky but once implemented it is scalable to all the instructions.

tao-stones commented 3 weeks ago

tried criterion-perf-events on StakeInstruction::Initialize, setting measure as Instructions, it prints:

Gnuplot not found, using plotters backend
stakes/initialize       time:   [125501.2873 cycles 125501.9226 cycles 125502.5530 cycles]
                        change: [-0.0002% +0.0011% +0.0024%] (p = 0.12 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe

not sure how to read this, are the cycles are "count of instructions"?

apfitzge commented 3 weeks ago

Hmm i think I linked the wrong library looking at that output.

this is what i was thinking of: https://github.com/bheisler/iai

https://bheisler.github.io/criterion.rs/book/iai/getting_started.html

ksolana commented 3 weeks ago

Ah nice! It already uses cachegrind(which is part of the valgrind). Simplifies our work 🔥

tao-stones commented 3 weeks ago

iirc iai cannot exclude setup code, if true, then it is a deal breaker for this use case.

ksolana commented 3 weeks ago

Can we compute the setup cost and subtract from every calculation?

tao-stones commented 3 weeks ago

That's doable, then the project is more like a one-time effort. Still worth trying. Maybe can have both criterion benchmarks and iai benches, then need something automatically subtract setup cost from iai results.

tao-stones commented 3 weeks ago

Also, do you guys know if iai and valgrind are stable on MacOS?

tao-stones commented 3 weeks ago

Got iai work on linux, a result looks like this:

iai bench StakeInitialize ``` Running benches/stake_iai.rs (target/release/deps/stake_iai-680ef0d7fd289c61) bench_setup_initialize Instructions: 159403 L1 Accesses: 202150 L2 Accesses: 255 RAM Accesses: 929 Estimated Cycles: 235940 bench_initialize Instructions: 290080 L1 Accesses: 369241 L2 Accesses: 630 RAM Accesses: 1991 Estimated Cycles: 442076 ```

Note: simply subtracting these two Instructions doesn't give correct answer, cause there are more "inner" pre-execution steps (eg invoke context and accounts priming) and post-execution steps.

ksolana commented 3 weeks ago

Also, do you guys know if iai and valgrind are stable on MacOS?

it's a linux thing. MacOS stopped supporting valgrind in a reasonable way. And for these computations we want to measure X86 instruction count anyways.

tao-stones commented 5 days ago

@KirillLykov @ksolana just fyi, separated re-pricing builtin instruction into simd-198, updated this issue's description.