LinuxPerf, @profile, and other experiments

willow-ahrens commented 4 hours ago

This issue is to document various PRs surrounding Linuxperf and other extensible benchmarking in Benchmarktools. I've seen many great approaches, with various differences in semantics and interfaces. It seems that https://github.com/JuliaCI/BenchmarkTools.jl/pull/375 profiles each eval loop (toggling on and off with a boolean), https://github.com/JuliaCI/BenchmarkTools.jl/pull/347 is a generalized version of the same (unclear whether this can generalize to more than one extension at a time, such as profiling and perfing), and https://github.com/JuliaCI/BenchmarkTools.jl/pull/325 only perfs a single execution.I recognize that different experiments require different setups. A sampling profiler requires a warmup and a minimum runtime, but probably doesn't need fancy tuning. A wall-clock time benchmark requires a warmup and a fancy eval loop where the evaluations are tuned, and maybe a gc scrub. What does LinuxPerf actually need? Are there any other experiments we also want to run (other than linuxperf?). Do we need to use metaprogramming to inline the LinuxPerf calls, or are function calls wrapping the samplefunc sufficient here? @vchuravy @DilumAluthge @topolarity @Zentrik

willow-ahrens commented 4 hours ago

Some discussion from Slack: @topolarity It is beneficial to be able to inline the LinuxPerf calls, since they can be implemented in just 1-2 instructions (although they are a syscall) @topolarity The current implementation should already be inlined by the compiler (at least in my proposed PR - I'm not 100% sure whether the same is true for the generalized version) @willow-ahrens I see, so the idea is to inline the linuxperf call to toggle instruction counting on and off, without introducing any additional overhead from function calls and stack popping, etc. @topolarity Yeah, exactly @willow-ahrens Does julia have any performance counting infrastructure beyond linuxperf we would want to be aware of? @topoloarity LinuxPerf needs a setup and teardown also, so that you guarantee you don't leak any PerfGroup objects (the PMU has limited resources, so we only want to actually ask it to schedule the specific measurements we need, or else it will start dropping samples) @topolarity Which is also why PMU-derived measurements generally need some kind of cooperation from the kernel (perf in the Linux case, or a custom driver in the Windows case - VTune is probably the most common example)

DilumAluthge commented 4 hours ago

@gbaraldi I remember you being interested in this in the past.

willow-ahrens commented 4 hours ago

I think a few questions I have remaining are:

When should setup and teardown be called for LinuxPerf? How many samples are needed, and does this match the requirements for wall-clock measurements or should we be designing a different sampling function for each different kind of experiment? I think it's plausible that we wouldn't really care about evals and tuning for performance counters.
Does julia have any performance counting infrastructure beyond linuxperf we would want to be aware of?

willow-ahrens commented 4 hours ago

@vchuravy says: It's late on a Friday here so I won't follow the discussion until Monday. One of the questions is how much do we want to be platform specific, and how willing are we to make big changes. One question is cycles with time, cpu-time vs w all-time. https://github.com/JuliaCI/BenchmarkTools.jl/pull/94https://github.com/JuliaCI/BenchmarkTools.jl/pull/92 Right now BT measures Wall-Time which is unreliable, but interpretable. Something like cycles (either through LinuxPerf or simply TSC) is more reliable, but harder to interpret (darn chips clocking down under heat, IPC...) Other tools measure FLOP/s or Bytes/s (Like if, LinuxPerf) So maybe BenchmarkTools ought to provide a "specification" (e.g. @benchmarkable ) and then different tools could provide executors that measure different things.

willow-ahrens commented 4 hours ago

It's starting to seem to me that BenchmarkTools really ought to define separate "samplers" which can measure different metrics using different tools and experiment loops, and provide infrastructure to run different samplers across suites of benchmarks.

willow-ahrens commented 3 hours ago

@vchuravy I think we should probably move forward with a short-term straightforward LinuxPerf PR like https://github.com/JuliaCI/BenchmarkTools.jl/pull/375, (assuming we can get a few reviews on it). We would mark the feature as experimental so we can make breaking changes to it. Later, we can work towards a BenchmarkTools interface which allows for more ergonomic custom benchmarking extensions (with @benchmark defining the function to be measured, and a separate "executor" or "sampler" interface which runs an experiment on the function). The redesign would be a good opportunity to fix https://github.com/JuliaCI/BenchmarkTools.jl/issues/339, and perhaps allow for choices between measuring or non measuring gc time, etc.

JuliaCI / BenchmarkTools.jl

LinuxPerf, @profile, and other experiments #377