Shopify / yjit-metrics

"Tasks for benchmarking, building and collecting stats for YJIT"
MIT License
14 stars 9 forks source link

Double-check harness and warm up heuristics #316

Open maximecb opened 1 month ago

maximecb commented 1 month ago

Kokubun's compilation time investigation made me want to check if we're including compilation time during benchmarking iterations on yjit-bench. The answer is that for the lobster benchmark, which is the largest of our headline benchmarks, yes we are. We're using a fixed number of warm-up iterations (15), and this is enough for smaller benchmarks such as railsbench, but not for lobsters.

Just by printing the delta in compiled_block_count after each iteration for railsbench, we can see that it warms up in just 3 iterations:

itr #1: 4830ms
compiled blocks delta: 32303
itr #2: 1265ms
compiled blocks delta: 61
itr #3: 1376ms
compiled blocks delta: 392
itr #4: 1248ms
compiled blocks delta: 0
itr #5: 1287ms
compiled blocks delta: 0
itr #6: 1249ms
compiled blocks delta: 0
itr #7: 1243ms
compiled blocks delta: 0
itr #8: 1302ms
compiled blocks delta: 0
itr #9: 1253ms
compiled blocks delta: 0
itr #10: 1318ms

The situation is the same for many small benchmarks.

However, for lobsters, this value doesn't reach a stable zero for a long while. It slowly trends towards zero, but there's a looong tail with a few blocks being compiled even after 300 iterations. My guess is that there is a long tail of low-probability branch stubs that keep slowly being hit.

Lobsters blocks compiled per iteration ``` itr #1: 8372ms compiled blocks delta: 51708 itr #2: 3024ms compiled blocks delta: 8638 itr #3: 1042ms compiled blocks delta: 1165 itr #4: 961ms compiled blocks delta: 1208 itr #5: 733ms compiled blocks delta: 357 itr #6: 931ms compiled blocks delta: 865 itr #7: 887ms compiled blocks delta: 900 itr #8: 1068ms compiled blocks delta: 1590 itr #9: 863ms compiled blocks delta: 683 itr #10: 757ms compiled blocks delta: 424 itr #11: 689ms compiled blocks delta: 143 itr #12: 789ms compiled blocks delta: 360 itr #13: 698ms compiled blocks delta: 199 itr #14: 684ms compiled blocks delta: 106 itr #15: 767ms compiled blocks delta: 403 itr #16: 737ms compiled blocks delta: 115 itr #17: 638ms compiled blocks delta: 5 itr #18: 739ms compiled blocks delta: 327 itr #19: 657ms compiled blocks delta: 17 itr #20: 879ms compiled blocks delta: 673 itr #21: 655ms compiled blocks delta: 60 itr #22: 641ms compiled blocks delta: 8 itr #23: 645ms compiled blocks delta: 0 itr #24: 859ms compiled blocks delta: 501 itr #25: 648ms compiled blocks delta: 45 itr #26: 654ms compiled blocks delta: 15 itr #27: 641ms compiled blocks delta: 18 itr #28: 669ms compiled blocks delta: 32 itr #29: 694ms compiled blocks delta: 1 itr #30: 792ms compiled blocks delta: 475 itr #31: 668ms compiled blocks delta: 56 itr #32: 641ms compiled blocks delta: 9 itr #33: 703ms compiled blocks delta: 32 itr #34: 660ms compiled blocks delta: 28 itr #35: 654ms compiled blocks delta: 26 itr #36: 646ms compiled blocks delta: 4 itr #37: 694ms compiled blocks delta: 18 itr #38: 637ms compiled blocks delta: 7 itr #39: 647ms compiled blocks delta: 22 itr #40: 1074ms compiled blocks delta: 1414 itr #41: 721ms compiled blocks delta: 6 itr #42: 667ms compiled blocks delta: 23 itr #43: 703ms compiled blocks delta: 2 itr #44: 691ms compiled blocks delta: 13 itr #45: 680ms compiled blocks delta: 3 itr #46: 646ms compiled blocks delta: 11 itr #47: 634ms compiled blocks delta: 0 itr #48: 633ms compiled blocks delta: 0 itr #49: 701ms compiled blocks delta: 6 itr #50: 649ms compiled blocks delta: 3 itr #51: 641ms compiled blocks delta: 14 itr #52: 636ms compiled blocks delta: 3 itr #53: 672ms compiled blocks delta: 1 itr #54: 641ms compiled blocks delta: 1 itr #55: 633ms compiled blocks delta: 2 itr #56: 626ms compiled blocks delta: 0 itr #57: 640ms compiled blocks delta: 8 itr #58: 695ms compiled blocks delta: 1 itr #59: 651ms compiled blocks delta: 2 itr #60: 719ms compiled blocks delta: 234 itr #61: 651ms compiled blocks delta: 7 itr #62: 695ms compiled blocks delta: 10 itr #63: 649ms compiled blocks delta: 14 itr #64: 645ms compiled blocks delta: 1 itr #65: 628ms compiled blocks delta: 1 itr #66: 689ms compiled blocks delta: 2 itr #67: 638ms compiled blocks delta: 2 itr #68: 634ms compiled blocks delta: 0 itr #69: 633ms compiled blocks delta: 1 itr #70: 651ms compiled blocks delta: 0 itr #71: 681ms compiled blocks delta: 0 itr #72: 650ms compiled blocks delta: 3 itr #73: 641ms compiled blocks delta: 8 itr #74: 651ms compiled blocks delta: 0 itr #75: 693ms compiled blocks delta: 0 itr #76: 631ms compiled blocks delta: 0 itr #77: 626ms compiled blocks delta: 0 itr #78: 638ms compiled blocks delta: 0 itr #79: 629ms compiled blocks delta: 0 itr #80: 675ms compiled blocks delta: 0 itr #81: 622ms compiled blocks delta: 0 itr #82: 636ms compiled blocks delta: 1 itr #83: 634ms compiled blocks delta: 4 itr #84: 666ms compiled blocks delta: 2 itr #85: 622ms compiled blocks delta: 0 itr #86: 648ms compiled blocks delta: 0 itr #87: 634ms compiled blocks delta: 6 itr #88: 676ms compiled blocks delta: 7 itr #89: 633ms compiled blocks delta: 0 itr #90: 638ms compiled blocks delta: 0 itr #91: 640ms compiled blocks delta: 0 itr #92: 633ms compiled blocks delta: 0 itr #93: 702ms compiled blocks delta: 0 itr #94: 652ms compiled blocks delta: 0 itr #95: 644ms compiled blocks delta: 0 itr #96: 638ms compiled blocks delta: 0 itr #97: 685ms compiled blocks delta: 0 itr #98: 650ms compiled blocks delta: 0 itr #99: 641ms compiled blocks delta: 0 itr #100: 643ms compiled blocks delta: 12 ```

@rwstauner My first question would be: what are we currently doing for warm-up in yjit-metrics? Are we using the same fixed 15 warm-up iterations?

@k0kubun The second question is what do we want to do about this? We may want to implement the same solution in both yjit-metrics and yjit-bench, even though they don't use the same harness.

In theory we can just bump the number of iterations for warm-up to a bigger value, but this doesn't seem great, because it will slow down smaller benchmarks. Another solution is to have some kind of adaptive mechanism. We could decide that a benchmark is warmed up when it doesn't compile any new blocks for 5 iterations, for example, and also make sure that benchmarks run for at least 8-10 warm up iterations minimum. That may actually speed up things for smaller benchmarks. I guess as a first step I will check with all of the headline benchmarks that this heuristic would actually work.

@nirvdrum a third question is what are we actually compiling in lobsters after 300+ iterations? Am I correct in assuming that it's branch stubs being hit? This is a question to be answered using the compilation log, and once we have that answer, we may want to think of heuristics to stop compiling new code.

maximecb commented 1 month ago

Update: doing some quick empirical testing on our headline benchmarks, it looks like an adaptive heuristic which goes something like "we are done warming up after a minimum of 8 iterations, and after we've compiled zero blocks for 4 iterations" would probably work decently well in practice. Though it wouldn't be 100% perfect.

The hexapdf benchmark exhibits a similar long-tail behavior as lobsters. It has a weird spike in compiled blocks at iteration #31, and this spike occurs after 8 iterations with zero blocks compiled.

I'm actually seeing that a number of benchmarks have a spike in compiled blocks after 30 iterations, which makes sense given our call threshold heuristic. An alternative way to do adaptive warm-up could be to benchmark for a minimum of 30 iterations, and then we can check in-hindsight if we stopped compiling after the 10th iteration. If not, count the first 30 iterations as warm-up and then run for another 30 iterations, then check how much we compiled during those 30 new iterations, keep going until we get a minimum number of benchmarking iterations with zero or near-zero new blocks compiled.

A third way to go would be to just increase the number of warm-up iterations to 40 and call it good enough (though that could obviously slow down some benchmarks).

k0kubun commented 1 month ago

Just by printing the delta in compiled_block_count after each iteration for railsbench

off topic: I often need to manually modify yjit-bench locally to achieve things like this. Now that you're doing it too, I filed a PR to add that feature https://github.com/Shopify/yjit-bench/pull/330. This helped the following investigation:

The second question is what do we want to do about this?

At first, I thought the two-step call threshold might be causing this. However, forcing it to --yjit-call-threshold=10 did not fix the problem; it just kept compiling encode_basic@(eval at /opt/rubies/ruby/lib/ruby/gems/3.4.0+0/gems/htmlentities-4.3.4/lib/htmlentities/encoder.rb:81) even with an increased number of benchmark iterations.

Given that it's coming from an eval, I suspect that the warmup of lobsters may never end (until it hits the --yjit-exec-mem-size limit). If that's the case, we can't fix this from yjit-bench's side. We should see if we can fix how they use htmlentities.gem.

Hexapdf does stop compiling after a reasonably small number of iterations, so it makes sense to do something to increase the number of warmup iterations for hexapdf if necessary.

maximecb commented 1 month ago

I guess I'd prefer not defining the number of warmup iterations by hand for each benchmark as it seems failure-prone. We could change how YJIT decides to compile code and break our manually set iteration counts without noticing. I could try to experiment with different heuristics for adaptable warm-up.

Given that it's coming from an eval, I suspect that the warmup of lobsters may never end (until it hits the --yjit-exec-mem-size limit). If that's the case, we can't fix this from yjit-bench's side. We should see if we can fix how they use htmlentities.gem.

Hmm yeah. If there was a way to get rid of that eval it would be ideal. If not, then we would have to settle for "still compiling but not compiling much proportionally" to be good enough.

maximecb commented 1 month ago

@rwstauner can you provide some context on how warm-up is handled in yjit-metrics?

rwstauner commented 1 month ago

it just kept compiling encode_basic@(eval at /opt/rubies/ruby/lib/ruby/gems/3.4.0+0/gems/htmlentities-4.3.4/lib/htmlentities/encoder.rb:81)

I think this is what Jean fixed: https://github.com/lobsters/lobsters/pull/1320 That PR has merged, but we haven't synchronized that over to yjit-bench yet.

rwstauner commented 1 month ago

The current default that yjit-metrics is using is a static 10 warmups: https://github.com/Shopify/yjit-metrics/blob/7938da4d7214d631052ca784f505bf5dda5a4eda/continuous_reporting/benchmark_and_update.rb#L72

maximecb commented 1 month ago

Thank you Randy. Still not sure what to do here. I feel like ideally we should figure out a better criteria and make things more uniform between yjit-bench and yjit-metrics (use the same warm up criteria). I may experiment with some other warm-up approaches tomorrow.

rwstauner commented 1 month ago

Back when we were using the "variable warmup report" (prior to May 2024) it would default to 30 warmups, which made the whole thing take several more hours

maximecb commented 1 month ago

Ok maybe there is a way that we can do something like 30 warm-up iterations or maximum 1 minute or maximum 30 seconds or something like that.

rwstauner commented 1 month ago

I think it's worth looking at the differences between the two repos. I'd love to make the yjit-bench harness modular enough to handle whatever differences we need here, but I haven't compared them yet (and I know we want to keep the harness slim to avoid altering memory usage, etc).

rwstauner commented 1 month ago

Looking at the 2024-09-18 results for arm, these are the results where warmup (10 iterations) is already taking more than 30s:

                     max        avg     total
       30k_methods: 5.46        3.85    38.51
     fannkuchredux: 6.46        4.30    43.01
        psych-load: 4.62        4.56    45.60
           fluentd: 5.76        4.63    46.30
           hexapdf: 6.97        5.42    54.21
        railsbench: 7.50        5.99    59.88
         optcarrot: 8.18        7.98    79.83
         ruby-json: 8.23        8.22    82.15
           graphql: 9.67        9.30    93.01
           rubykon: 18.87       18.60   185.97

Only 4 of those are currently taking more than 1 minute to do 10 warmup itrs.

Most warmup itrs currently look to be under 1s, so doing up to 30 warmups or 60 seconds would probably get us pretty far.

maximecb commented 1 month ago

With respect to Rubykon, that benchmark seems to take especially long to run with CRuby, but it is 3x faster with YJIT. Maybe one simple thing that we can do is cap the warmup iterations for CRuby to 4 or 5 iterations, and use a variable warm-up for YJIT?

I'm also going to go look if I can edit Rubykon to have a faster per-iteration time, because it is super duper slow on CRuby.