Fix harness to not call timer repeatedly in the measured loop.

leafpetersen commented 5 years ago

The timer is probably fast enough not to matter on most platforms that you'll encounter, but why take chances with your health?

cc @mraleph @munificent

leafpetersen commented 5 years ago

cc @lrhn

mraleph commented 5 years ago

In general we assume that exercise is more expensive than accessing the time. This can be tuned on the per-benchmark basis. I agree this is suboptimal.

Note thta the change which is done here makes benchmark_harness incompatible with the same code living inside Golem repo - we might want to keep them aligned if possible, but changing code in the Golem repo would invalidate benchmark results that have already been accumulated.

lrhn commented 5 years ago

I'll admit that the e.elapsedMilliseconds in every loop is exactly what I do all the time. If execise is so fast that it makes a difference, I'd add a repeat loop inside exercise. Maybe let exercise return an int which is the "number of operations", instead of counting one call as one operation.

You could use the warm-up phase to figure out how fast exercise is, and then repeat calls to exercise a number of times between checking elapsed milliseconds, to ensure that you it takes at least ~50 ms per round.

(I would also never use a package just to write a benchmark, but probably because I'm not writing a package to begin with, so I'd have to create a package first).

leafpetersen commented 5 years ago

Note thta the change which is done here makes benchmark_harness incompatible with the same code living inside Golem repo - we might want to keep them aligned if possible, but changing code in the Golem repo would invalidate benchmark results that have already been accumulated.

Then fix Golem!!!! Seriously, that's the worst reason I've ever heard. "The data set is bad, but if we fix it, it becomes inconsistent, so we're going to keep collecting bad data....". WAT?

This package is being advanced as best practices for benchmarking, and it's broken. Yes, it's true that on most platforms the underlying hardware provides fast (O(cycles)) access to timer data, and the compiler will inline the virtual instance call to the stopwatch methods. But why ship a package that relies on both of those if you don't have to? What if someone wants to do something crazy like benchmark on an interpreter? We're picking up new platforms left and right, are you 100% sure that none of those will ever require software assistance (maybe even a syscall) to get timing data? Why ship something that's just going to silently give bogus data for those cases when there are numerous straightforward ways to do better?

mraleph commented 5 years ago

Then fix Golem!!!! Seriously, that's the worst reason I've ever heard. "The data set is bad, but if we fix it, it becomes inconsistent, so we're going to keep collecting bad data....". WAT?

If benchmark is written the right way collected data is not bad. As I mentioned before one just need to make sure that exercise takes significantly longer time to hide the cost of consulting the timer.

I do agree that it is suboptimal, error prone and not documented anywhere, and yes it would be nice to have this fixed.

However we also don't have capacity to backfill years of benchmarking data that were accumulated, so we need to make some sort of decision:

make incompatible change to all benchmarks, purge all historic data, do not backfill
use updated harness on new benchmarks only

That's the perspective I was trying to communicate - sorry if it came out non appreciative of your effort to fix the harness.

leafpetersen commented 5 years ago

If benchmark is written the right way collected data is not bad. As I mentioned before one just need to make sure that exercise takes significantly longer time to hide the cost of consulting the timer.

Totally agree, for handcrafted benchmarks written by someone who understands all the issues, and is aware of the limitations of the platforms they are running on. But that's not what this package is for, right?

However we also don't have capacity to backfill years of benchmarking data that were accumulated, so we need to make some sort of decision:

Sure you must already have to deal with discontinuities in the data? What happens when we upgrade v8? Or buy new machines? Or install a new version of the OS, or.... ? I guess I don't really see how you can expect to both keep everything frozen so you get comparable data, and get good data.

That's the perspective I was trying to communicate - sorry if it came out non appreciative of your effort to fix the harness.

Oh, no problem. I mean, I don't have to use the package, so... whatever :) And sorry if I came across strongly. But I do think this should be fixed.... :)

leafpetersen commented 5 years ago

Ok, just to make this point concretely, I took the benchmark that @munificent wrote and added a second call to elapsedMilliseconds to every iteration (changing it from doing one call, to two). Running this on my mac, using the AOT compiler, shows a 30% slowdown for adding the second call on the smallest size of his benchmark, decreasing slowly to essentially zero overhead at the largest size. This suggests that the existing call to elapsedMilliseconds is adding approximately 30% to the measured time on the small sizes. It is also possible that adding the second call to elapsedMilliseconds disables inlining of the first: but this is also consistent with my point: you cannot rely on this being inlined and fast. I would claim that if two different Dart team members can get bit by this, then this is a problem.

I've uploaded my patch as a branch in case anyone wants to check my work (worth doing, this was a really quick hack).

Numbers are below: I hacked the output to include a % overhead which is computed as the difference between the two rates as a percentage of the rate of the version with one call.

  0 iterate            9036.14 spreads/ms      32.01% overhead                 --------------------
   0 List for          10225.36 spreads/ms      32.71% overhead  1.13x baseline ======================
   0 resize and set     9415.40 spreads/ms      31.92% overhead  1.04x baseline ====================
   0 addAll()           7779.46 spreads/ms      28.66% overhead  0.86x baseline =================
   0 forEach()          8729.16 spreads/ms      28.69% overhead  0.97x baseline ===================

   1 iterate            7170.16 spreads/ms      26.77% overhead                 --------------------
   1 List for           8993.50 spreads/ms      30.33% overhead  1.25x baseline =========================
   1 resize and set     7153.62 spreads/ms      26.19% overhead  1.00x baseline ===================
   1 addAll()           6439.56 spreads/ms      24.14% overhead  0.90x baseline =================
   1 forEach()          7604.76 spreads/ms      28.01% overhead  1.06x baseline =====================

   2 iterate            6095.72 spreads/ms      23.46% overhead                 --------------------
   2 List for           8115.98 spreads/ms      28.47% overhead  1.33x baseline ==========================
   2 resize and set     6211.76 spreads/ms      23.31% overhead  1.02x baseline ====================
   2 addAll()           5769.72 spreads/ms      22.14% overhead  0.95x baseline ==================
   2 forEach()          6699.96 spreads/ms      24.81% overhead  1.10x baseline =====================

   5 iterate            3394.40 spreads/ms      14.44% overhead                 --------------------
   5 List for           4977.94 spreads/ms      19.88% overhead  1.47x baseline =============================
   5 resize and set     4969.28 spreads/ms      19.31% overhead  1.46x baseline =============================
   5 addAll()           3648.06 spreads/ms      16.29% overhead  1.07x baseline =====================
   5 forEach()          4121.58 spreads/ms      16.48% overhead  1.21x baseline ========================

  10 iterate            2253.42 spreads/ms      10.12% overhead                 --------------------
  10 List for           3805.74 spreads/ms      15.99% overhead  1.69x baseline =================================
  10 resize and set     3197.16 spreads/ms      13.54% overhead  1.42x baseline ============================
  10 addAll()           3183.38 spreads/ms      14.12% overhead  1.41x baseline ============================
  10 forEach()          3009.68 spreads/ms      14.79% overhead  1.34x baseline ==========================

  20 iterate            1188.58 spreads/ms       4.32% overhead                 --------------------
  20 List for           1992.42 spreads/ms       8.94% overhead  1.68x baseline =================================
  20 resize and set     2099.58 spreads/ms       9.68% overhead  1.77x baseline ===================================
  20 addAll()           2051.76 spreads/ms      11.14% overhead  1.73x baseline ==================================
  20 forEach()          1604.06 spreads/ms       7.30% overhead  1.35x baseline ==========================

  50 iterate             555.70 spreads/ms       2.27% overhead                 --------------------
  50 List for            925.52 spreads/ms       3.12% overhead  1.67x baseline =================================
  50 resize and set     1040.44 spreads/ms       4.57% overhead  1.87x baseline =====================================
  50 addAll()            982.38 spreads/ms       3.98% overhead  1.77x baseline ===================================
  50 forEach()           766.18 spreads/ms       5.14% overhead  1.38x baseline ===========================

 100 iterate             291.92 spreads/ms       1.55% overhead                 --------------------
 100 List for            494.50 spreads/ms       2.31% overhead  1.69x baseline =================================
 100 resize and set      555.18 spreads/ms       5.23% overhead  1.90x baseline ======================================
 100 addAll()            535.92 spreads/ms       3.69% overhead  1.84x baseline ====================================
 100 forEach()           400.38 spreads/ms       1.50% overhead  1.37x baseline ===========================

1000 iterate              32.18 spreads/ms      -0.87% overhead                 --------------------
1000 List for             57.10 spreads/ms       0.35% overhead  1.77x baseline ===================================
1000 resize and set       61.70 spreads/ms       0.39% overhead  1.92x baseline ======================================
1000 addAll()             57.50 spreads/ms      -0.94% overhead  1.79x baseline ===================================
1000 forEach()            45.38 spreads/ms      -0.13% overhead  1.41x baseline ============================

lrhn commented 5 years ago

There is a fairly simple thing you can do to recognize fast-running run methods. I'd probably prefer to more thorough rewrite, but this seems to work: https://github.com/lrhn/benchmark_harness/compare/master...lrhn:short-run

leafpetersen commented 5 years ago

If benchmark is written the right way collected data is not bad

@mraleph let me try this again, see if I can be clearer. If the benchmark is written the right way, the collected data is not bad and this change should not change the data. If the benchmark is not written the right way, then the collected data is bad, the change will change the data... and that's good. So I'm not seeing the argument for not fixing.

mraleph commented 5 years ago

I don't really disagree that this needs to be fixed.

kevmoo commented 5 years ago

@leafpetersen – want to run dartfmt -w .? Then we should be good.

mateusfccp commented 2 years ago

Has this PR been abandoned?

kevmoo commented 2 years ago

Has this PR been abandoned?

I'd guess....yes.

sortie commented 2 years ago

I appreciate the problem and the solutions proposed in this pull request. It is a genuine problem for our benchmarks that the run() method is not appropriately tuned, which is difficult to do so it is expensive enough for fast developer machines and not too expensive for our slower embedded devices. Improperly tuned benchmarks are the root cause of jumpy and noisy benchmarks.

The solution requires a data driven approach. Right now benchmarks warmup for 100 ms and run for 2000 ms and the exercise method calls the run method 10 times to avoid timer overhead, which causes the benchmarks to report a number that's off by 10x. These parameters have not been tuned in many years and are not picked on the basis of real data.

A couple years ago, I did some experiments where I measured the time to run each iteration and saw many benchmarks that were not properly warmed up or had jumpy performance. Now that we've unified on package:benchmark_harness, the next step is to add this profiling support so we can identity benchmarks that aren't properly tuned and fix them. We'll likely also get valuable data to help tune the virtual machine when the performance does not reach a steady state.

Once we have that data, we'll be able to make a proper solution to the root problem this PR aims to improve. I'm not opposed to this PR, but I very much would prefer an experiment on the actual production benchmarking system that confirms if it gives more accurate benchmark data. E.g. the proposed code has a cutoff where if a benchmark is just slow enough, it might run for twice as many iterations, which could result in a very jumpy benchmark if it was close to that threshold. This is a real problem in practice already for benchmarks where the run method is massively too expensive on certain platforms and the exercise method only runs a couple times or even once.

leafpetersen commented 2 years ago

I don't have strong opinions about how this gets fixed. I do continue to be of the opinion (supported, I believe, by the data I presented above) that calling timers repeatedly inside of the benchmark loop is very bad practice, and should not be our recommended solution.

osa1 commented 2 years ago

I did some experiments where I measured the time to run each iteration and saw many benchmarks that were not properly warmed up or had jumpy performance. Now that we've unified on package:benchmark_harness, the next step is to add this profiling support so we can identity benchmarks that aren't properly tuned and fix them. We'll likely also get valuable data to help tune the virtual machine when the performance does not reach a steady state.

Once we have that data, we'll be able to make a proper solution to the root problem this PR aims to improve

This PR fixes an obvious problem with the benchmark runner, which is that it includes timer calls in the thing being benchmarked. Fixing this should only improve things because (1) we no longer include unrelated computation in the results being reported and (2) it's unclear what kind of impact the system calls like clock_gettime on Linux and mach_timebase_info on OSX will have on benchmark results so it's a good idea to avoid them. For example, will the context switch for these calls require another warmup round?

So the problem of outliers in benchmarks or unstable benchmark results will potentially be improved with this PR. If we still want the noise that elapsedMicroseconds (clock_gettime, mach_timebase_info, ...) adds to the benchmarks we can always have benchmarks specifically for that, with elapsedMicroseconds calls added in the code being benchmarked. It doesn't make sense to report invalid results to users just because we can somehow use noisy results to improve VM performance, Golem, etc.

mraleph commented 2 years ago

I think we should go ahead and land this and roll it into Golem.

sortie commented 2 years ago

This CL in its current state looks like it can run the benchmark for almost twice as long as desired which is an expensive increase in load.

I'm generally not happy about making ad-hoc changes to the harness without doing a qualitative analysis with the huge data set we have available. We can prioritize doing this work, Bill and I have a lot of ideas, but other work has taken priority.

mraleph commented 2 years ago

@sortie As a consumer of benchmark results I would prefer that I get accurate data even if I have to wait longer for results. I think we should be pragmatic here - land and roll this in Golem and observe the metrics. If we discover that benchmark result latency has become unacceptably high then we can easily roll the change back and figure out alternative approaches.

mraleph commented 2 years ago

@mkustermann could you take another quick look? I am going to merge after that.

mraleph commented 2 years ago

I have checked on a few simple benchmarks and runtime is not significantly affected by the change (e.g. I have not seen anything like 2x slowdown) so overall the change looks safe to land.

dart-lang / benchmark_harness

Fix harness to not call timer repeatedly in the measured loop. #38