dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.92k stars 4.64k forks source link

Investigate improving JIT heuristics with machine learning #92915

Closed AndyAyersMS closed 1 month ago

AndyAyersMS commented 11 months ago

Overview

There is now fairly substantial literature on improving compiler optimization by leveraging machine learning (ML). A comprehensive survey compiled by Zheng Wang can be found here: https://github.com/zwang4/awesome-machine-learning-in-compilers.

Here's a closely relevant paper and github repo: MLGO: a Machine Learning Guided Compiler Optimizations Framework. Like our efforts below it leverages a Policy Gradient algorithm (reinforcement Learning).

Several years ago we attempted to leverage ML to create better inline heuristics. That experiment was largely a failure, at least as far as using ML to predict if an inline would improve performance. But one of the key speculations from that work was that lack of PGO was to blame. Now that we have PGO, it is a good time to revisit this area to see if it PGO was indeed the key missing ingredient.

Proposed Work

During the .NET 9 product cycle, we plan to investigate applying ML techniques to heuristics used by the JIT. The items below represent the initial steps of this investigation. This list will change and grow as the work progresses.

We also want to tackle a relatively simple problem, at least initially. Thus our initial effort will be to try and refine and improve the heuristic the JIT uses for Common Subexpression Elimination (aka CSE):


Update May 2024.

Given the work above we have been able to produce heuristics that can improve the aggregate perf score for methods via CSE, with about a 0.4% geomean improvement across all methods with CSE candidates. So far those results haven't translated into widespread improvements in our perf lab data. Exactly why is unclear ... data below would suggest one or more of the following:

The training evaluations show that the best ML heuristic is obtaining about half of what could be gotten from an optimal heuristic. Some recent results:

Best/base: 0.9913 [optimizing for  score]
vs Base    0.9957 Better 440 Same 1476 Worse 84
vs Best    1.0044 Better 0 Same 1607 Worse 393

Params     0.6093, 0.7750,-0.5701,-0.6827, 0.5060,-0.0514,-1.3563, 0.4515,-2.0999, 0.0000,-1.0623,-1.3723, 0.0000,-0.6143,-0.8286,-0.2956,-1.1433, 0.3418, 1.3964,-0.0043,-0.4237, 0.6097,-1.9773,-0.3684, 0.7246

Collecting greedy policy data via SPMI... done (27583 ms)
Greedy/Base (score): 34965 methods, 8375 better, 24792 same, 1797 worse,  0.9960 geomean
Best:   76679 @  0.7731
Worst:  82000 @  1.4512

Greedy/Base (size): 34965 methods, 4628 better, 24694 same, 5642 worse,  1.0005 geomean
Best:   15489 @  0.7000
Worst:  82000 @  1.4205

  We don't have any more active work planned on machine learning of heuristics for .NET 9. However, we expect to revisit this area as .NET 9 work winds down, so stay tuned for further development in the coming months.

ghost commented 11 months ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

Issue Details
### Overview There is now fairly substantial literature on improving compiler optimization by leveraging machine learning (ML). A comprehensive survey compiled by [Zheng Wang](https://github.com/zwang4) can be found here: https://github.com/zwang4/awesome-machine-learning-in-compilers. [Several years ago](https://github.com/AndyAyersMS/PerformanceExplorer/blob/master/notes/notes-aug-2016.md) we attempted to leverage ML to create better inline heuristics. That experiment was largely a failure, at least as far as using ML to predict if an inline would improve performance. But one of the key speculations from that work was that lack of PGO was to blame. Now that we have PGO, it is a good time to revisit this area to see if it PGO was indeed the key missing ingredient. ### Proposed Work During the .NET 9 product cycle, we plan to investigate applying ML techniques to heuristics used by the JIT. The items below represent the initial steps of this investigation. This list will change and grow as the work progresses. We also want to tackle a relatively simple problem, at least initially. Thus our initial effort will be to try and refine and improve the heuristic the JIT uses for Common Subexpression Elimination (aka CSE): - [ ] Study the current CSE heuristic and try and understand what it is modelling. Identify apparent shortcomings and limitations. - [ ] Develop tooling for varying which CSEs the JIT can perform, either in concert with the current heuristic or independent of it. Measure the impact variation of this on key JIT metrics (performance, code size, jit time, etc). - [ ] See how well PerfScores can be used as a proxy for measuring actual performance. Ideally, we can leverage PerfScores to avoid needing to run benchmarks repeatedly. But if not, perhaps we can still rely on PerfScores to limit or prioritize the runs needed. - [ ] Try a number of different ML techniques for improving CSE heuristics, both "small step" (evaluating one CSE at a time) and "big step" (evaluate an entire set of CSEs). Identify the key predictive observations. Try both interpretable and black box models. - [ ] See how feasible it is to have a single architecture-agnostic heuristic (perhaps parameterized like the current heuristic with register info) or whether different heuristics are warranted for say arm64 and x64. - [ ] Develop a common ML "infrastructure" that we can apply to other similar heuristics in the JIT. ---- cc @dotnet/jit-contrib
Author: AndyAyersMS
Assignees: AndyAyersMS
Labels: `area-CodeGen-coreclr`
Milestone: 9.0.0
AndyAyersMS commented 11 months ago

Initial Notes on the Current JIT CSE Heuristic

  1. Does the initial frame size computation hold up well? Can we justify the 3/2/1 formula in CSEHeuristic::Initialize? a. Disable CSE, see how well number of tracked locals in registers maps up with the prediction. If not, see if there is some way to make a better prediction. b. Then slowly enable CSE, and see if this continues to hold up well. c. See how well frame size matches estimates; if not good, see if we can improve this somehow

  2. Do the aggressive/moderate thresholds make sense? That is, should they be higher or lower? Based on something else? a. trickier to see how to model this; perhaps keep processing CSEs and see when spilling starts? (that is, do a local sensitivity analysis) -- might be counterbalanced by the promotion code. b. how about the overall approach of setting these thresholds based on IR observations? Seems sensible but maybe there are other ideas to try? c. at the end of CSEHeuristic::Initialize we increase the thresholds if they seem to low. Why? d. also doesn't that increase mix up weighted and unweighted cases? How does it make sense to compare unweighted ref counts with BB_UNITY_WEIGHT? e. should we try some “true” register pressure estimate, or is that hopeless? We also don’t know how the pressure points intersect with potential CSE temp liveness.

  3. Is the assumption here that FP CSEs are rare true?

  4. How reliable are CostSz and CostEx? a. probably a can of worms... how would we even go about validating these?

  5. Should candidates be ranked by local benefit, or by global benefit, or some ratio of benefit / cost? a. That is, why doesn’t Compiler::optCSEcostCmpEx consider the csdUseWtCnt more prominently? Seems like the right figure of merit (when optimizing for speed is csdUseWtCnt * tree->CostEx() so a relatively “cheap” CSE that heavily used is preferable to a “costly” CSE that is not as heavily used. b. Should this factor in code size even in speed modes? Assuming CSEs generally decrease size a bit, we might want to bias the above towards the heavier ones slightly. c. Do the threshold increases in PerformCSE make sense? This gives a budget-like aspect to the ranking, so the order does matter. What if we ignored this?

  6. Are we missing candidates? a. review bail outs in Compiler::optIsCSEcandidate. b. review limitations in Compiler::optValnumCSE_Locate

  7. In ConsiderCandidates is the computation sensible? We have 3 models x 2 modes ({aggressive/moderate/conservative} x {size, speed}). Plus some stress modes (including random...) a. seems like we mix weighted/unweighted costs up here at times. eg extra_no_cost is unweighed (size) based, but we add it to no_cse_cost, which can be weighted (speed) based. b. various magic numbers and twisty decision logic c. various per-arch costs, some of them look dated

  8. Is the "shared constant" CSE code sensible? Here we disregard the low bits of constants and at use points we rematerialize the missing bits via an add. a. So use cost is a bit higher... but this does not seem to be modelled by the heuristics ? b. Is the "best base" value computation reasonable? Seems like we're trying to use small immediates here?

AndyAyersMS commented 11 months ago

Perf vs Perf Score

Here's a chart of about 3000 measurements across 50 or so benchmarks, varying the set of CSEs we do in the hottest method. Each data point is the ratio of the perf vs the default JIT perf, vs the ratio of the perf score vs the default JIT perf score. Thus values above 1.0 on the y-axis are benchmarks that ran more slowly, and values to the right of 1.0 on the x-axis are benchmarks that were predicted to run more slowly.

We see most data in the "slower and predicted slower" (upper right) quadrant because CSEs generally improve perf, so doing less of them than we'd normally do will likely slow down the benchmark.

Linear fit has a correlation coefficient of about 0.3, so there is a fair amount of unexplained variation in perf. Also I have excluded one benchmark's data from this as it seems anomalous.

image

Ideally, we'd be able to get a better correlation out of perf scores, since these can be produced very quickly as we vary JIT behavior, on the order of 1000 experiments per second. If we actually run benchmarks then it takes around 30 seconds for one experiment, so there is a massive benefit to leveraging perf scores.

The main contributors to poor perf predictability are thought to be (in no particular order):

So the task here is to dig into these issues and see if and how perf scores can be improved.

Also I need to look into why MaximizeSchwarzCriterion has anomalous results; including data from this benchmark drops R^2 to essentially zero.

image

ShreyasJejurkar commented 11 months ago

This is a great initiative, just wondering do other popular (JIT-based) programming languages (e.g. - Java) have something like this in place? Would love to read their outcome. :)

AndyAyersMS commented 11 months ago

This is a great initiative, just wondering do other popular (JIT-based) programming languages (e.g. - Java) have something like this in place? Would love to read their outcome. :)

I think I may have read some papers by the Graal folks where they leveraged ML, but I don't have the references handy. Presumably they'd be somewhere in the big list of papers I referred to above.

AndyAyersMS commented 11 months ago

Have been digging into why the PerfScores have such poor correlation on MaximizeSchwarzCriterion and it seems like it is a combination of a few effects:

Thus the block weights (in particular the large weights for innermost loop blocks) can vary quite a bit from one run to the next, and that seems to be the major factor. You can see this by just running the benchmark multiple times with the same jit configuration:

Benchmark,Index,Mask,NumCse,CodeSize,PerfScore,PerfScoreRatio,Perf,PerfRatio
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,0,ffffffff,17,1643,649125874.85,1.000,59311.6714,1.000
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,1,00000000,0,1533,80505239.17,0.124,61848.4732,1.043
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,2,00000001,1,1532,421090857.67,0.649,65608.5883,1.106
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,3,00000002,1,1528,1307172767.70,2.014,59236.0308,0.999
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,4,00000004,1,1584,108418141.52,0.167,65119.4767,1.098
Benchmark,Index,Mask,NumCse,CodeSize,PerfScore,PerfScoreRatio,Perf,PerfRatio
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,0,ffffffff,17,1643,100478799.14,1.000,61020.7827,1.000
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,1,00000000,0,1533,69888793.36,0.696,63267.6950,1.037
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,2,00000001,1,1530,69621450.93,0.693,66241.6050,1.086
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,3,00000002,1,1528,108183513.60,1.077,58789.5833,0.963
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,4,00000004,1,1582,70447641.30,0.701,64906.9033,1.064

In particular the first line of each set above shows two "identical" runs with default JIT behavior, the first took 59,311us with PerfScore 649,125,874.85 and the second 61,020us with PerfScore 100,478,799.14. So perf varied by about 1.03x and PerfScore about 6x.

For the purposes of this investigation, we can just exclude methods that have OSR methods.

This variation in weights is possibly less of a problem in practice though it may be a contributor to the kind of fluctuating performance we noted in #87324. We should think about better strategies to counteract it:

MichalPetryka commented 11 months ago
  • Running a Tier-1 instr stage before we go to full on Tier-1, for methods that OSR'd, so that the root level blocks get a more accurate weight?

Isn't this already the case for R2Red code?

AndyAyersMS commented 11 months ago
  • Running a Tier-1 instr stage before we go to full on Tier-1, for methods that OSR'd, so that the root level blocks get a more accurate weight?

Isn't this already the case for R2Red code?

Yes -- the idea there was to avoid needing to go from R2R (optimized) -> Tier0 (unoptimized) to gather PGO data. Here the idea would be to go from Tier0+instr -> (OSR) -> Tier1+ instr -> Tier1, for methods that are called infrequently (but more than once) and contain hot loops. The second instrumented run is there to fill out the missing bits of profile in the parts of the method that currently only run under OSR.

AndyAyersMS commented 8 months ago

Some more thoughts on applying ML to the CSE heuristic.

My current thinking is to approach this as a model-based replacement learning application. We think we know which CSEs are good ones (our initial model) and we don't have any repository of known good results (aka labels), and possibly may incur significant costs in obtaining data. Our goal is to improve or find a better model.

For the initial cut we can just rely on perf scores (flawed as they may be) as the overall approach won't be that impacted; if and when we have it working well, we can think about mixing in actual measurements. Perf scores (if viable) offer a lot of advantages including the ability to assess results on methods that are not prominent in any benchmarks.

As a prelude to all that I want to elaborate the current model and figure out the right way to parameterize it. Currently CSE candidates are ranked by a sort function, and then attempted in rank order. As CSEs are performed the acceptance thresholds ratchet upwards so there is an order dependence. So we might hope that the reinforcement-learned model could be expressed as some numeric score with say higher values indicating higher preferences and lower values lower ones, and perhaps zero or negative values indicating CSEs that should not be done.

The ratcheting is an attempt to model register pressure. The thresholds are loosely set based upon properties of the weighted local var counts (here I am ignoring for now the "optimize for size" mode), eg the aggressive cutoff is the weight of the 13th highest ranked local var (pre-CSE). More broadly one can imagine that the model would have access to general properties of the lcl var weight distribution, expressed as a sort of CDF, eg we would like to know roughly how many lcl vars do we have to consider to reach say 90% of all (weighted) local var occurrences.

Broadly speaking a particular CSE we will add a new local var whose weight we can compute (from the def and use weights) and remove weight from any of the locals. The current heuristic roughly estimates the first (via the ratchet) but does not take into account the second. Thus for instance if we CSEd a large computation tree to affect a hoist within a loop we might actually expect the overall register pressure to decrease. Thinking of this all as a sort of Markov Decision Process, as we do one CSE the viability of the others will shift.

If we want the model to handle this sort of order dependence, then after each CSE we need to re-evaluate the scores of the remaining candidates.

I will note along these lines that the current CSE heuristic has some quirks that seem interesting. For example in a simple case with just two candidates A & B, if we order them A, B we do both CSEs, if we order them B, A, we do just B, because B is is call-crossing and has a higher ratchet and makes A look unattractive. This struck me as odd.

So, factors that might feed into the model:

Some preliminary sub problems to examine:

AndyAyersMS commented 8 months ago

Here is some data on the current heuristc, from the asp.net SPMI collection [full table here]

115838 methods, 34802 methods with cses; 193807 cse candidates, 56973 cses
NumCandiates [count, odds]: ... details ...
 1 [n: 10259 o: 0.53]:  0: 4836;  1: 5423
 2 [n:  4913 o: 0.44]:  0: 1760;  1: 2028;  2: 1125
 3 [n:  4058 o: 0.36]:  0: 1351;  1: 1329;  2: 1031;  3:  347
 4 [n:  3454 o: 0.37]:  0:  788;  1: 1100;  2:  860;  3:  570;  4:  136
 5 [n:  2576 o: 0.35]:  0:  586;  1:  497;  2:  792;  3:  470;  4:  175;  5:   56
 6 [n:  1627 o: 0.31]:  0:  264;  1:  452;  2:  474;  3:  182;  4:  227;  5:   21;  6:    7
 ...

I was surprised to see such low odds for accepting candidates, on average only about 0.25. So there is certainly headroom to be more aggressive.

For these methods with small numbers of candidates it should be feasible to exhaustively enumerate all the possible sets of CSEs and look at the resulting spectrum of perf scores, and chart out how well the existing heuristic performs (as measured by perf score) vs the best score. For larger sets we can at least attempt to gather a decent sample of data.

Working on that now.

AndyAyersMS commented 8 months ago

Here is some initial data on the performance of the current heuristic.

Using the asp.net SPMI collection I sorted 100 candidates from each of the categories above, and then ran $2^n + 1$ experiments: the baseline run and then jitted once for each different subset of CSEs possible (say there were 5 CSE candidates, this is 33 runs), and then looked at how often the default heuristic got the best possible perf score, and the geomean ratio of the default perf score to the best across those 100 runs.

The preliminary data looks like this:

  --- 1 candidate cases: default heuristic was optimal in 90 of 100 cases; geomean 1.003
  --- 2 candidate cases: default heuristic was optimal in 81 of 100 cases; geomean 1.006
  --- 3 candidate cases: default heuristic was optimal in 65 of 100 cases; geomean 1.008
  --- 4 candidate cases: default heuristic was optimal in 62 of 100 cases; geomean 1.006
  --- 5 candidate cases: default heuristic was optimal in 50 of 100 cases; geomean 1.012
  --- 6 candidate cases: default heuristic was optimal in 49 of 100 cases; geomean 1.006

So the rough trend is that as the number of candidates increases, the likelihood of the default heuristic getting the best possible perf score decreases, and in aggregate the default heuristic falls short of the best perf score by around 0.6% or so.

This data took a few minutes to gather; I'll allow it to run on more cases overnight.

I have some rough edges to polish here and to explore some of the even larger candidate cases I'll need some kind of monte-carlo strategy as the exponential number of runs start becoming a factor.

AndyAyersMS commented 8 months ago

Here is the "complete" set of results from asp.net, up through cases with 11 cse candidates. My current approach is going to take to long to get data for higher numbers of candidates, eg for an 11 candidate case we need 2^11 runs of SPMI over the method context varying the CSE mask, so the entire set of 526 case required over 1 million runs.

  ---  1 candidate cases: baseline heuristic was optimal in 8899 of 10259 cases 86.74%; geomean 1.005
  ---  2 candidate cases: baseline heuristic was optimal in 3782 of  4913 cases 76.98%; geomean 1.007
  ---  3 candidate cases: baseline heuristic was optimal in 2757 of  4058 cases 67.94%; geomean 1.006
  ---  4 candidate cases: baseline heuristic was optimal in 2087 of  3454 cases 60.42%; geomean 1.005
  ---  5 candidate cases: baseline heuristic was optimal in 1382 of  2576 cases 53.65%; geomean 1.010
  ---  6 candidate cases: baseline heuristic was optimal in  825 of  1627 cases 50.71%; geomean 1.006
  ---  7 candidate cases: baseline heuristic was optimal in  492 of   999 cases 49.25%; geomean 1.006
  ---  8 candidate cases: baseline heuristic was optimal in  522 of   979 cases 53.32%; geomean 1.004
  ---  9 candidate cases: baseline heuristic was optimal in  357 of   711 cases 50.21%; geomean 1.004
  --- 10 candidate cases: baseline heuristic was optimal in  278 of   637 cases 43.64%; geomean 1.004
  --- 11 candidate cases: baseline heuristic was optimal in  236 of   526 cases 44.87%; geomean 1.004

Next steps from here:

  1. Consider making the above faster. We should be able to use a release jit (if we make perf score and other metrics available) and run the various cases concurrently. That would speed up the data gathering by maybe 20x. We could also fan out to multiple machines...
  2. In some of the runs above the default heuristic is able to get a better perf score than any of the masked variants. Figure out why.
  3. In some of the runs above the SPMI replay fails with some CSE subsets. The ones I looked at were missing helper call infos for write barrier helpers. See if we can perhaps extend the SPMI collection to make replay more robust.
  4. I have been assuming there is no order dependence in in the masked CSE replays. Say there are 3 candidates: the above process will do 8 runs: default, {}, {1}, {1, 2}, {1, 3}, {2, 3}, {1, 2, 3}. But if there is order dependence, we should also be running the various permutations: {2, 1}, {3, 1}, {3, 2}, {1, 3, 2} {2, 1, 3}, {2, 3, 1}, {3, 1, 2}, {3, 2, 1}. We might need to do this anyways since we will likely be estimating the reward (benefit) of the CSE via some sort of "temporal difference" scheme. At any rate we should validate the order dependence or independence (order dependence may explain point 2 above).
  5. Even if there is no order dependence in results there might be one in modelling. Consider the case of a nested CSE. If 2 nests within 1, then we get slightly different IR if we do {1, 2} vs {2, 1} -- abstractly this should not matter for codegen but in practice it might. But when looking at rewards, the reward for 2 will be diminished if we do 1 first, and the reward for 1 will be diminished if we do 2 first (and it's possible that 2 becomes non-viable after 1, if all its uses are subsumed by uses of 1).
  6. Assuming we're happy with the data gathered above, the next step is to figure out how to make us of it to craft a policy. The rough thinking is to come up with a set of observations that we think are relevant (we already have some) and build some sort parameterized evaluation function. We then redo the runs above and record the metrics at each step, and then use the reinforcement learning algorithm to gradually modify the evaluation function so that it increasingly is able to pick the optimal sequences of CSEs.
  7. We should employ good hygene here and not over-fit, so doing some kind of 5 or 10-fold cross validation seems sensible. That means running the above 10 times.
  8. We should also try other collections / arches /os to see how well that aspect generalizes.
  9. All of this is "perf score optimal" and may not reflect reality, so we still need to sanity check with actual results somehow.
AndyAyersMS commented 8 months ago

I figured out what was leading to (2) above -- I'd forgotten that the config settings get parsed as hex, so for the 4+ candidate cases the tool was not properly exploring the full subset of possibilities. With that fixed I don't see any cases where the default heuristic can do better than a subset.

Updated data shows that for the 4+ cases the heuristic is now a bit worse (not surprising as some of those missing subsets were going to be good ones) and the upside opportunity is a bit larger (also not surprising). Also added numbers on how many SPMI runs are required, note as the candidate counts increase the evaluation takes exponentially more runs, but the number of candidates falls off so the growth overall is a bit milder...

  --- 1 candidate cases: baseline heuristic was optimal in 8899 of 10259 cases 86.74%; geomean 1.005 (30777 runs)
  --- 2 candidate cases: baseline heuristic was optimal in 3782 of  4913 cases 76.98%; geomean 1.007 (24565 runs)
  --- 3 candidate cases: baseline heuristic was optimal in 2757 of  4058 cases 67.94%; geomean 1.006 (36522 runs)
  --- 4 candidate cases: baseline heuristic was optimal in 2030 of  3454 cases 58.77%; geomean 1.006 (58718 runs)
  --- 5 candidate cases: baseline heuristic was optimal in 1311 of  2576 cases 50.89%; geomean 1.012 (85008 runs)
  --- 6 candidate cases: baseline heuristic was optimal in  734 of  1627 cases 45.11%; geomean 1.009 (105755 runs)
  --- 7 candidate cases: baseline heuristic was optimal in  415 of   999 cases 41.54%; geomean 1.008 (128871 runs)
AndyAyersMS commented 8 months ago

Fixed some other issues; latest full run data

  --- 1 candidate cases: baseline heuristic was optimal in 8899 of 10259 cases 86.74%; geomean 1.005 (30777 runs)
  --- 2 candidate cases: baseline heuristic was optimal in 3782 of  4913 cases 76.98%; geomean 1.007 (24565 runs)
  --- 3 candidate cases: baseline heuristic was optimal in 2757 of  4058 cases 67.94%; geomean 1.006 (36522 runs)
  --- 4 candidate cases: baseline heuristic was optimal in 2030 of  3454 cases 58.77%; geomean 1.006 (58718 runs)
  --- 5 candidate cases: baseline heuristic was optimal in 1311 of  2576 cases 50.89%; geomean 1.012 (85008 runs)
  --- 6 candidate cases: baseline heuristic was optimal in  734 of  1627 cases 45.11%; geomean 1.009 (105755 runs)
  --- 7 candidate cases: baseline heuristic was optimal in  415 of   999 cases 41.54%; geomean 1.008 (128871 runs)
  --- 8 candidate cases: baseline heuristic was optimal in  454 of   979 cases 46.37%; geomean 1.009 (251603 runs)
AndyAyersMS commented 8 months ago

I have a version of MCMC running for larger instances, and have verified that it matches up pretty well to exhaustive enumeration in the medium-sized cases. The trend noted above continues:

  ---  6 candidate cases: baseline heuristic was optimal in 734 of 1627 cases 45.11%; geomean 1.009 baseline win from CSE 0.960 max win 0.951 (105755 runs in    775,494ms)
  ---  7 candidate cases: baseline heuristic was optimal in 415 of  999 cases 41.54%; geomean 1.008 baseline win from CSE 0.957 max win 0.949 (128871 runs in  1,742,522ms)
  ---  8 candidate cases: baseline heuristic was optimal in 451 of  979 cases 46.07%; geomean 1.009 baseline win from CSE 0.958 max win 0.949 (251603 runs in  3,634,848ms)
  ---  9 candidate cases: baseline heuristic was optimal in 278 of  711 cases 39.10%; geomean 1.008 baseline win from CSE 0.956 max win 0.949 (182727 runs in  5,070,407ms)
  --- 10 candidate cases: baseline heuristic was optimal in 183 of  637 cases 28.73%; geomean 1.008 baseline win from CSE 0.959 max win 0.951 (163709 runs in  6,406,325ms)
  --- 11 candidate cases: baseline heuristic was optimal in 142 of  526 cases 27.00%; geomean 1.008 baseline win from CSE 0.957 max win 0.949 (135182 runs in  7,543,211ms)
  --- 12 candidate cases: baseline heuristic was optimal in 231 of  535 cases 43.18%; geomean 1.009 baseline win from CSE 0.966 max win 0.957 (137495 runs in  8,664,782ms)
  --- 13 candidate cases: baseline heuristic was optimal in  86 of  342 cases 25.15%; geomean 1.009 baseline win from CSE 0.962 max win 0.953 (87894  runs in  9,376,476ms)
  --- 14 candidate cases: baseline heuristic was optimal in  63 of  265 cases 23.77%; geomean 1.009 baseline win from CSE 0.943 max win 0.935 (68105  runs in  9,937,161ms)
  --- 15 candidate cases: baseline heuristic was optimal in  51 of  237 cases 21.52%; geomean 1.009 baseline win from CSE 0.959 max win 0.951 (60909  runs in 10,473,161ms)
  --- 16 candidate cases: baseline heuristic was optimal in  75 of  262 cases 28.63%; geomean 1.007 baseline win from CSE 0.967 max win 0.960 (67334  runs in 11,120,183ms)
  --- 17 candidate cases: baseline heuristic was optimal in  72 of  246 cases 29.27%; geomean 1.007 baseline win from CSE 0.976 max win 0.969 (63222  runs in 11,702,941ms)
  --- 18 candidate cases: baseline heuristic was optimal in  34 of  196 cases 17.35%; geomean 1.007 baseline win from CSE 0.959 max win 0.952 (50372  runs in 12,185,341ms)
  --- 19 candidate cases: baseline heuristic was optimal in  29 of  198 cases 14.65%; geomean 1.010 baseline win from CSE 0.968 max win 0.959 (50886  runs in 12,697,428ms)
  --- 20 candidate cases: baseline heuristic was optimal in  24 of  172 cases 13.95%; geomean 1.010 baseline win from CSE 0.961 max win 0.951 (44204  runs in 13,134,515ms)
  --- 21 candidate cases: baseline heuristic was optimal in   6 of  110 cases  5.45%; geomean 1.010 baseline win from CSE 0.953 max win 0.944 (28270  runs in 13,445,214ms)
  --- 22 candidate cases: baseline heuristic was optimal in  13 of   97 cases 13.40%; geomean 1.006 baseline win from CSE 0.955 max win 0.949 (24929  runs in 13,706,788ms)
  --- 23 candidate cases: baseline heuristic was optimal in   4 of   92 cases  4.35%; geomean 1.011 baseline win from CSE 0.966 max win 0.955 (23644  runs in 14,004,816ms)
  --- 24 candidate cases: baseline heuristic was optimal in   5 of  119 cases  4.20%; geomean 1.013 baseline win from CSE 0.967 max win 0.955 (30583  runs in 14,316,467ms)
  --- 25 candidate cases: baseline heuristic was optimal in   7 of   63 cases 11.11%; geomean 1.011 baseline win from CSE 0.967 max win 0.957 (16191  runs in 14,510,376ms)
  --- 26 candidate cases: baseline heuristic was optimal in   2 of   97 cases  2.06%; geomean 1.010 baseline win from CSE 0.974 max win 0.964 (24929  runs in 14,789,976ms)
  --- 27 candidate cases: baseline heuristic was optimal in   1 of   74 cases  1.35%; geomean 1.022 baseline win from CSE 0.954 max win 0.934 (19018  runs in 15,023,453ms)
  --- 28 candidate cases: baseline heuristic was optimal in   7 of   73 cases  9.59%; geomean 1.015 baseline win from CSE 0.974 max win 0.960 (18761  runs in 15,263,888ms)
  --- 29 candidate cases: baseline heuristic was optimal in   1 of   69 cases  1.45%; geomean 1.009 baseline win from CSE 0.975 max win 0.966 (17733  runs in 15,469,362ms)
  --- 30 candidate cases: baseline heuristic was optimal in  17 of   66 cases 25.76%; geomean 1.011 baseline win from CSE 0.945 max win 0.934 (16962  runs in 15,683,842ms)

Using the above I can now generate "perf-score" optimal for cases with small CSEs, and can elaborate the entire value function $V(s, a)$ either explicitly or graphically... eg

image

Here green shows the optimal policy, dog-eared the current policy, square boxes are terminal states.

As noted in (4) above there doesn't seem to be order-dependence in the values; if so that greatly cuts down the size of the state space. If there is order dependence then the state space grows faster than $n!$ -- actual growth is https://oeis.org/A000522. If there is no order dependence than it's just $2^n$. Still need MCMC for larger cases, but can cover a much greater fraction.

So I plan to screen all this data to see if I ever see a case where the order of CSEs matters for final perf score. If it never happens or is suitably rare, we'll just ignore it.


The data above can't directly be used to craft a policy, since the state is a specific method instance plus a sequence of CSEs. In reality the jit won't have already seen a given method... so we need to generalize the state and try building a policy that way.

My initial plan is to use a policy gradient method. We know some of the factors that should influence CSEs, these are already considered by the current policy, though perhaps not in optimal form:

So we can start with these. Since there is a large dynamic range for weighed costs I am tempted to use a log to keep the data a bit more reasonable. We also need to assign some preference to stop the process, so there is one extra action to model. The initial model will likely be linear, so the preference for an action (CSE) will be the dot product of the observations vector based on that CSE and ambient state, and a parameter vector that is all numeric and provides weights for each observation. This can be generalized to include nonlinear functions of observations (or combinations of observations) if necessary.

For the reward we can use absolute perf scores, but this doesn't generate well across methods, so I am tempted to use the relative perf score over doing no CSEs (or could also be the relative score compared to the best known score).

AndyAyersMS commented 7 months ago

For reasons I find hard to justify I decided my $V(s)$ and $Q(s,a)$ graph needed colormapping. This is from the exploration done by Policy Gradient. Here (yellow) doing all 3 CSEs is best, in any order.

Top number in each state is min perf score, bottom is the average. As the heuristic evolves one hopes the average approaches the min, as the fumbling initial steps are more or less forgotten (image below is after 100 runs).

Double-circle indicates what the current jit CSE heuristic does.

image - 2024-01-22T172846 515

AndyAyersMS commented 7 months ago

Status update...

The basic Policy Gradient mechanism seems to be working correctly, though it is finicky enough that I am not 100% sure. The step size seems to be a fairly tricky aspect, and there are some novel (or at least novel-seeming) aspects to the problem space we're exploring that may complicate an assessment.

With the latest driver program (https://github.com/dotnet/jitutils/pull/391) we can take an SPMI collection, find the subset of the methods with CSE candidates, and then randomly or deterministically run over a subset of that subset, doing one or more of:

The screenshot below shows a sample run.

Here the driver is looking at a 5 method subset out of 35557 possible methods. It first runs MCMC on these 5; since their candidate numbers are small MCMC exhaustively tries all combinations, and for two of the methods finds better results than the current JIT heuristic.

The driver then tries to create a better policy via iterative updates of the policy parameters. Each row shows the average minibatch result for a method, blue means (the last run of the minibatch) matches the current jit heuristic, and green means it improves on it, white or yellow is worse. After 25 round-robin minibatches it does a summary vs the current jit and the best known result on the select subset (best known from stochastic policy explorations), and then the result of using the current parameter set in a greedy policy across all 35557 methods. Here the driver thinks it has found on aggregate a roughly 0.2% better heuristic with just this small amount of work, but there are a substantial number of regressions too.

It turns out to be not too hard to beat the current heuristic (w/similar numbers of better/worse) by just doing all CSEs, so this result is perhaps not all that impressive -- what would be impressive here is getting closer to the best results (which we only know for the 5 methods) with minimal regression. And hence my focus on finding better ways to encourage it to stop early.

image

Some ideas on next steps:

AndyAyersMS commented 7 months ago

Have been studying more examples where Policy Gradient struggles. In one such we have a CSE that is not live across call in flow, but is live across call in the LSRA sequence, so it leads to a spill.

It could be that the various "pressure" estimates are inherently flawed in this way, as there is no easy way to anticipate the ordering LSRA will choose and so no clear way to figure out the possible max conflict (pressure) points.

I have a feature now which roughly tries to assess the vulnerability of a CSE to shenanigans; currently this is the maximum reverse postorder number diff between any two appearances (def or use), with the idea that LSRA order might be something like this. I suppose I could add a check there if that range contains calls?

Currently we purge the DFS after RBO, but we could keep it viable and query it in CSE with the idea that it is still a decent estimate of what the DFS would be like. Or just run a new DFS (which we might do anyways, if we follow up on the ideas in https://github.com/dotnet/runtime/issues/97243#issuecomment-1928691033).

AndyAyersMS commented 7 months ago

Another interesting case. Here we hoist a NOT that is under and AND, not realizing that andn is possible. Not sure we can mark all similar kinds of subtrees as GTF_NO_CSE and I wonder if we should be removing these and letting ML figure this out instead.

I suppose I can try and check if a CSE candidate is an ADD underneath an indir and mark those as needing extra scrutiny.

Also we CSE and hoist rbx+0x18 but don't use it in the mov, presumably that is marked as an addr mode, but even in the no CSE case we redundantly compute this and decouple it from cmpxchg. From what I can tell cmpxchg supports at least an offset so this (in the before code) is a wasted register.

Might make sense for CSE to accumulate all these GTF_DONT_CSE uses, and if we decide to do the CSE for other reasons, perhaps do the no cse cases as well, otherwise we may just be creating more pressure as whatever operands these add modes need have to stay live along with the CSE.


;; NO CSEs
G_M39425_IG03:  ;; offset=0x0015
       mov      eax, dword ptr [rbx+0x18]
       mov      dword ptr [rsp+0x24], eax
       lea      rcx, bword ptr [rbx+0x18]
       andn     edx, edi, eax
       or       edx, esi
       lock     
       cmpxchg  dword ptr [rcx], edx
       cmp      eax, dword ptr [rsp+0x24]
       jne      SHORT G_M39425_IG05

;; CSEs

       lea      rdi, bword ptr [rbx+0x18]
       mov      ebp, r8d
       not      ebp

G_M39425_IG03:  ;; offset=0x001C
       mov      eax, dword ptr [rbx+0x18]
       mov      dword ptr [rsp+0x2C], eax
       mov      ecx, ebp
       and      ecx, eax
       or       ecx, esi
       lock     
       cmpxchg  dword ptr [rdi], ecx
       cmp      eax, dword ptr [rsp+0x2C]
       jne      SHORT G_M39425_IG05
AndyAyersMS commented 7 months ago

At this point I'm mainly watching how the PG evolves and looking at methods that it doesn't handle well, trying to understand what about the candidates in those cases is different/unusual, to see if it's a mistaken feature value, a missing feature value, or whatnot.

An alternative idea is to use MCMC (or whatever) to broadly classify candidates as good or bad (since for the most part order does not seem to matter, so goodness/badness is not contextual, and features don't evolve much over the course of an episode). Good if they are part of the best CSE sequence, and bad if not. Then grab all these candidate features, and compute some sort of nearness/clustering analysis in the feature space (or build a classifier), to see if we can distinguish good or bad from the features; if we can't, then how can we expect a policy to distinguish them?

If we find a pair of candidates with similar features where one is good and the other is bad, study those cases to figure out what is going on.

AndyAyersMS commented 7 months ago

Also intrigued by the ideas in Simple random search provides a competitive approach to reinforcement learning which operates in the parameter space directly. Might be an interesting alternative to implement, and is similar to things I've tried in the past.

AndyAyersMS commented 7 months ago

Also for features, once we have decent/stable results we can try dropping features from the set to see how much the (lack of) a feature impairs overall results.

Might also search for correlated features and try and decorrelate them (eg gram-schmidt orthogonalization or similar). Say the value of feature A predicts value of feature B pretty well, then it might be worth subtracting of the predicted B (via A) from the actual B, so the reported features are A and B-(A's prediction of B).

The two ideas above are similar, if A and B are correlated then having them both provides little additional benefit over having just one of them.

AndyAyersMS commented 6 months ago

Have been looking into improving efficiency. The PolicyGradient has two phases -- a one-off where it runs method by method (using Parallel.For to launch multiple SPMI instances), and a batch where it processes them all (using SPMI's built in parallelism).

;; method by method
 6231 methods in 45716ms
;; batch
35482 methods in 45099ms

So the method by method has ~6x overhead...

I have tried building all the stuff needed into release, it is (perhaps) nearly there, but that won't explain the above, as both versions are running checked jits.

The batch mode is what I'm used to on this box, about 1K methods/sec.

AndyAyersMS commented 6 months ago

Here's a trial that enable the new heuristic: https://github.com/dotnet/runtime/pull/98776

Have added some notes there on the behaviors.

JulieLeeMSFT commented 1 month ago

Closing as activities during .NET 9 is done. We will revisit work to do during .NET 10 planning.