Investigate improving JIT heuristics with machine learning

dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.

https://docs.microsoft.com/dotnet/core/

MIT License

14.92k stars 4.64k forks source link

Investigate improving JIT heuristics with machine learning #92915

Closed AndyAyersMS closed 1 month ago

AndyAyersMS commented 11 months ago

Overview

There is now fairly substantial literature on improving compiler optimization by leveraging machine learning (ML). A comprehensive survey compiled by Zheng Wang can be found here: https://github.com/zwang4/awesome-machine-learning-in-compilers.

Here's a closely relevant paper and github repo: MLGO: a Machine Learning Guided Compiler Optimizations Framework. Like our efforts below it leverages a Policy Gradient algorithm (reinforcement Learning).

https://github.com/google/ml-compiler-opt

Several years ago we attempted to leverage ML to create better inline heuristics. That experiment was largely a failure, at least as far as using ML to predict if an inline would improve performance. But one of the key speculations from that work was that lack of PGO was to blame. Now that we have PGO, it is a good time to revisit this area to see if it PGO was indeed the key missing ingredient.

Proposed Work

During the .NET 9 product cycle, we plan to investigate applying ML techniques to heuristics used by the JIT. The items below represent the initial steps of this investigation. This list will change and grow as the work progresses.

We also want to tackle a relatively simple problem, at least initially. Thus our initial effort will be to try and refine and improve the heuristic the JIT uses for Common Subexpression Elimination (aka CSE):

[ ] Study the current CSE heuristic and try and understand what it is modelling. Identify apparent shortcomings and limitations.
- See initial notes below.
[x] Develop tooling for varying which CSEs the JIT can perform, either in concert with the current heuristic or independent of it. Measure the impact variation of this on key JIT metrics (performance, code size, jit time, etc).
[ ] See how well PerfScores can be used as a proxy for measuring actual performance. Ideally, we can leverage PerfScores to avoid needing to run benchmarks repeatedly. But if not, perhaps we can still rely on PerfScores to limit or prioritize the runs needed.
- See notes below for some initial data.
[ ] Try a number of different ML techniques for improving CSE heuristics, both "small step" (evaluating one CSE at a time) and "big step" (evaluate an entire set of CSEs). Identify the key predictive observations. Try both interpretable and black box models.
- I have been working on a Policy Gradient approach. The implementation is split with part of it in the JIT and the rest in a driver program. Documentation is lacking, so ping me if you want to try this at home.
- JIT part was introduced in https://github.com/dotnet/runtime/pull/96880
- Driver part in https://github.com/dotnet/jitutils/pull/389
- More work on stopping: https://github.com/dotnet/runtime/pull/98063
- More work on the driver: https://github.com/dotnet/jitutils/pull/391
- More features https://github.com/dotnet/runtime/pull/98257
- More work on the driver: https://github.com/dotnet/jitutils/pull/395
- Streaming SPMI mode https://github.com/dotnet/runtime/pull/98440
- And client updates to use that mode: https://github.com/dotnet/jitutils/pull/397
- Enable running in release with fixed parameter set: https://github.com/dotnet/runtime/pull/98729
- actually try enabling (draft): https://github.com/dotnet/runtime/pull/98776
[ ] See how feasible it is to have a single architecture-agnostic heuristic (perhaps parameterized like the current heuristic with register info) or whether different heuristics are warranted for say arm64 and x64.
[ ] Develop a common ML "infrastructure" that we can apply to other similar heuristics in the JIT.
[x] Set up a CI job to run the Policy Gradient
- https://github.com/dotnet/runtime/pull/98777 (running the policy, not the training)
[x] Set up a Perf lab experiment to run the current "best" greedy policy on benchmarks. Tis should be running now, will check on the data in a few days
- https://github.com/dotnet/runtime/pull/98779
- https://github.com/dotnet/performance/pull/3975 Also worth noting is Lee Culver's work to adapt the JIT CSE problem into a more standard gymnasium setting: https://github.com/dotnet/runtime/pull/101856

Update May 2024.

Given the work above we have been able to produce heuristics that can improve the aggregate perf score for methods via CSE, with about a 0.4% geomean improvement across all methods with CSE candidates. So far those results haven't translated into widespread improvements in our perf lab data. Exactly why is unclear ... data below would suggest one or more of the following:

poor correlation of perf score with actual perf
few benchmarks whose scores critically depend on CSEs, or
those benchmarks whose perf does critically depend on CSEs are all easy/obvious cases
difficulty extracting small signal from noise (still lots of noisy benchmark results)

The training evaluations show that the best ML heuristic is obtaining about half of what could be gotten from an optimal heuristic. Some recent results:

Best/base: 0.9913 [optimizing for  score]
vs Base    0.9957 Better 440 Same 1476 Worse 84
vs Best    1.0044 Better 0 Same 1607 Worse 393

Params     0.6093, 0.7750,-0.5701,-0.6827, 0.5060,-0.0514,-1.3563, 0.4515,-2.0999, 0.0000,-1.0623,-1.3723, 0.0000,-0.6143,-0.8286,-0.2956,-1.1433, 0.3418, 1.3964,-0.0043,-0.4237, 0.6097,-1.9773,-0.3684, 0.7246

Collecting greedy policy data via SPMI... done (27583 ms)
Greedy/Base (score): 34965 methods, 8375 better, 24792 same, 1797 worse,  0.9960 geomean
Best:   76679 @  0.7731
Worst:  82000 @  1.4512

Greedy/Base (size): 34965 methods, 4628 better, 24694 same, 5642 worse,  1.0005 geomean
Best:   15489 @  0.7000
Worst:  82000 @  1.4205

We don't have any more active work planned on machine learning of heuristics for .NET 9. However, we expect to revisit this area as .NET 9 work winds down, so stay tuned for further development in the coming months.

ghost commented 11 months ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

Issue Details

### Overview There is now fairly substantial literature on improving compiler optimization by leveraging machine learning (ML). A comprehensive survey compiled by [Zheng Wang](https://github.com/zwang4) can be found here: https://github.com/zwang4/awesome-machine-learning-in-compilers. [Several years ago](https://github.com/AndyAyersMS/PerformanceExplorer/blob/master/notes/notes-aug-2016.md) we attempted to leverage ML to create better inline heuristics. That experiment was largely a failure, at least as far as using ML to predict if an inline would improve performance. But one of the key speculations from that work was that lack of PGO was to blame. Now that we have PGO, it is a good time to revisit this area to see if it PGO was indeed the key missing ingredient. ### Proposed Work During the .NET 9 product cycle, we plan to investigate applying ML techniques to heuristics used by the JIT. The items below represent the initial steps of this investigation. This list will change and grow as the work progresses. We also want to tackle a relatively simple problem, at least initially. Thus our initial effort will be to try and refine and improve the heuristic the JIT uses for Common Subexpression Elimination (aka CSE): - [ ] Study the current CSE heuristic and try and understand what it is modelling. Identify apparent shortcomings and limitations. - [ ] Develop tooling for varying which CSEs the JIT can perform, either in concert with the current heuristic or independent of it. Measure the impact variation of this on key JIT metrics (performance, code size, jit time, etc). - [ ] See how well PerfScores can be used as a proxy for measuring actual performance. Ideally, we can leverage PerfScores to avoid needing to run benchmarks repeatedly. But if not, perhaps we can still rely on PerfScores to limit or prioritize the runs needed. - [ ] Try a number of different ML techniques for improving CSE heuristics, both "small step" (evaluating one CSE at a time) and "big step" (evaluate an entire set of CSEs). Identify the key predictive observations. Try both interpretable and black box models. - [ ] See how feasible it is to have a single architecture-agnostic heuristic (perhaps parameterized like the current heuristic with register info) or whether different heuristics are warranted for say arm64 and x64. - [ ] Develop a common ML "infrastructure" that we can apply to other similar heuristics in the JIT. ---- cc @dotnet/jit-contrib

Author:	AndyAyersMS
Assignees:	AndyAyersMS
Labels:	`area-CodeGen-coreclr`
Milestone:	9.0.0

AndyAyersMS commented 11 months ago

Initial Notes on the Current JIT CSE Heuristic

Does the initial frame size computation hold up well? Can we justify the 3/2/1 formula in CSEHeuristic::Initialize? a. Disable CSE, see how well number of tracked locals in registers maps up with the prediction. If not, see if there is some way to make a better prediction. b. Then slowly enable CSE, and see if this continues to hold up well. c. See how well frame size matches estimates; if not good, see if we can improve this somehow
Do the aggressive/moderate thresholds make sense? That is, should they be higher or lower? Based on something else? a. trickier to see how to model this; perhaps keep processing CSEs and see when spilling starts? (that is, do a local sensitivity analysis) -- might be counterbalanced by the promotion code. b. how about the overall approach of setting these thresholds based on IR observations? Seems sensible but maybe there are other ideas to try? c. at the end of CSEHeuristic::Initialize we increase the thresholds if they seem to low. Why? d. also doesn't that increase mix up weighted and unweighted cases? How does it make sense to compare unweighted ref counts with BB_UNITY_WEIGHT? e. should we try some “true” register pressure estimate, or is that hopeless? We also don’t know how the pressure points intersect with potential CSE temp liveness.
Is the assumption here that FP CSEs are rare true?
How reliable are CostSz and CostEx? a. probably a can of worms... how would we even go about validating these?
Should candidates be ranked by local benefit, or by global benefit, or some ratio of benefit / cost? a. That is, why doesn’t Compiler::optCSEcostCmpEx consider the csdUseWtCnt more prominently? Seems like the right figure of merit (when optimizing for speed is csdUseWtCnt * tree->CostEx() so a relatively “cheap” CSE that heavily used is preferable to a “costly” CSE that is not as heavily used. b. Should this factor in code size even in speed modes? Assuming CSEs generally decrease size a bit, we might want to bias the above towards the heavier ones slightly. c. Do the threshold increases in PerformCSE make sense? This gives a budget-like aspect to the ranking, so the order does matter. What if we ignored this?
Are we missing candidates? a. review bail outs in Compiler::optIsCSEcandidate. b. review limitations in Compiler::optValnumCSE_Locate
In ConsiderCandidates is the computation sensible? We have 3 models x 2 modes ({aggressive/moderate/conservative} x {size, speed}). Plus some stress modes (including random...) a. seems like we mix weighted/unweighted costs up here at times. eg extra_no_cost is unweighed (size) based, but we add it to no_cse_cost, which can be weighted (speed) based. b. various magic numbers and twisty decision logic c. various per-arch costs, some of them look dated
Is the "shared constant" CSE code sensible? Here we disregard the low bits of constants and at use points we rematerialize the missing bits via an add. a. So use cost is a bit higher... but this does not seem to be modelled by the heuristics ? b. Is the "best base" value computation reasonable? Seems like we're trying to use small immediates here?

AndyAyersMS commented 11 months ago

Perf vs Perf Score

Here's a chart of about 3000 measurements across 50 or so benchmarks, varying the set of CSEs we do in the hottest method. Each data point is the ratio of the perf vs the default JIT perf, vs the ratio of the perf score vs the default JIT perf score. Thus values above 1.0 on the y-axis are benchmarks that ran more slowly, and values to the right of 1.0 on the x-axis are benchmarks that were predicted to run more slowly.

We see most data in the "slower and predicted slower" (upper right) quadrant because CSEs generally improve perf, so doing less of them than we'd normally do will likely slow down the benchmark.

Linear fit has a correlation coefficient of about 0.3, so there is a fair amount of unexplained variation in perf. Also I have excluded one benchmark's data from this as it seems anomalous.

Ideally, we'd be able to get a better correlation out of perf scores, since these can be produced very quickly as we vary JIT behavior, on the order of 1000 experiments per second. If we actually run benchmarks then it takes around 30 seconds for one experiment, so there is a massive benefit to leveraging perf scores.

The main contributors to poor perf predictability are thought to be (in no particular order):

errors in perf score per-block computations: wrong latency for instructions, lack of modelling dependence, other microarchitectural effects. Some of this we may be able to fix by invoking tools like llvm-mca to either give us accurate data or help us spot errors in our modelling. And that in turn requires making the JIT disasm be something readily parseable by standard tools.
errors in block weights. Even though we have PGO data we may no longer have accurate block weights at the end of compilation. This is something we can improve, though fundamentally the JIT will be unable to correctly anticipate how weights may change because of optimizations, so some amount of error here is inevitable.
"noisy" benchmarks. We know many benchmarks show considerable within-run or cross-run variation. Some of these sources of noise include: intel JCC errata, alignment sensitivity of data, other subtle microarchitectural effects (say branch collision)
errors in PGO data. Note there is some deliberate randomness and some inherent randomness in PGO data. There may also be errors in the instrumentation, collection, or reconstruction processes. Also included here are potential phased behaviors by benchmarks (where PGO data does not reflect the actual benchmark block weights because the data was measured too early)

So the task here is to dig into these issues and see if and how perf scores can be improved.

Also I need to look into why MaximizeSchwarzCriterion has anomalous results; including data from this benchmark drops R^2 to essentially zero.

ShreyasJejurkar commented 11 months ago

This is a great initiative, just wondering do other popular (JIT-based) programming languages (e.g. - Java) have something like this in place? Would love to read their outcome. :)

AndyAyersMS commented 11 months ago

This is a great initiative, just wondering do other popular (JIT-based) programming languages (e.g. - Java) have something like this in place? Would love to read their outcome. :)

I think I may have read some papers by the Graal folks where they leveraged ML, but I don't have the references handy. Presumably they'd be somewhere in the big list of papers I referred to above.

AndyAyersMS commented 11 months ago

Have been digging into why the PerfScores have such poor correlation on MaximizeSchwarzCriterion and it seems like it is a combination of a few effects:

Scalable profiles are notably approximate and will vary from run to run
The exact timing of tiering up can vary, so the profile data will also vary from run to run
Block weight reconstruction is ill-conditioned (small input changes can lead to large output changes)
Methods that rely on OSR can end up with partial profiles. In particular for Tier0-instr methods that are called and never return we often do not have profile weight for the entry block since that is deduced from the returns.

Thus the block weights (in particular the large weights for innermost loop blocks) can vary quite a bit from one run to the next, and that seems to be the major factor. You can see this by just running the benchmark multiple times with the same jit configuration:

Benchmark,Index,Mask,NumCse,CodeSize,PerfScore,PerfScoreRatio,Perf,PerfRatio
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,0,ffffffff,17,1643,649125874.85,1.000,59311.6714,1.000
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,1,00000000,0,1533,80505239.17,0.124,61848.4732,1.043
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,2,00000001,1,1532,421090857.67,0.649,65608.5883,1.106
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,3,00000002,1,1528,1307172767.70,2.014,59236.0308,0.999
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,4,00000004,1,1584,108418141.52,0.167,65119.4767,1.098

Benchmark,Index,Mask,NumCse,CodeSize,PerfScore,PerfScoreRatio,Perf,PerfRatio
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,0,ffffffff,17,1643,100478799.14,1.000,61020.7827,1.000
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,1,00000000,0,1533,69888793.36,0.696,63267.6950,1.037
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,2,00000001,1,1530,69621450.93,0.693,66241.6050,1.086
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,3,00000002,1,1528,108183513.60,1.077,58789.5833,0.963
JetStream.TimeSeriesSegmentation.MaximizeSchwarzCriterion,4,00000004,1,1582,70447641.30,0.701,64906.9033,1.064

In particular the first line of each set above shows two "identical" runs with default JIT behavior, the first took 59,311us with PerfScore 649,125,874.85 and the second 61,020us with PerfScore 100,478,799.14. So perf varied by about 1.03x and PerfScore about 6x.

For the purposes of this investigation, we can just exclude methods that have OSR methods.

This variation in weights is possibly less of a problem in practice though it may be a contributor to the kind of fluctuating performance we noted in #87324. We should think about better strategies to counteract it:

more aggressive blending? We currently use 0.99 for blend factor, perhaps something more like 0.90?
Running a Tier-1 instr stage before we go to full on Tier-1, for methods that OSR'd, so that the root level blocks get a more accurate weight?
adding extra probes based on patchpoint location that can recover missing counts. The interesting case is the one where the patchpoint helper transitions to the OSR method—the helper would need to increment the counter. And since this probe would not normally appear in the schema we'd have to figure out how to fit it in without disrupting anything else. We do something like this for throws already, but those are easy to anticipate.

MichalPetryka commented 11 months ago

Running a Tier-1 instr stage before we go to full on Tier-1, for methods that OSR'd, so that the root level blocks get a more accurate weight?

Isn't this already the case for R2Red code?

AndyAyersMS commented 11 months ago

Running a Tier-1 instr stage before we go to full on Tier-1, for methods that OSR'd, so that the root level blocks get a more accurate weight?

Isn't this already the case for R2Red code?

Yes -- the idea there was to avoid needing to go from R2R (optimized) -> Tier0 (unoptimized) to gather PGO data. Here the idea would be to go from Tier0+instr -> (OSR) -> Tier1+ instr -> Tier1, for methods that are called infrequently (but more than once) and contain hot loops. The second instrumented run is there to fill out the missing bits of profile in the parts of the method that currently only run under OSR.

AndyAyersMS commented 8 months ago

Some more thoughts on applying ML to the CSE heuristic.

My current thinking is to approach this as a model-based replacement learning application. We think we know which CSEs are good ones (our initial model) and we don't have any repository of known good results (aka labels), and possibly may incur significant costs in obtaining data. Our goal is to improve or find a better model.

For the initial cut we can just rely on perf scores (flawed as they may be) as the overall approach won't be that impacted; if and when we have it working well, we can think about mixing in actual measurements. Perf scores (if viable) offer a lot of advantages including the ability to assess results on methods that are not prominent in any benchmarks.

As a prelude to all that I want to elaborate the current model and figure out the right way to parameterize it. Currently CSE candidates are ranked by a sort function, and then attempted in rank order. As CSEs are performed the acceptance thresholds ratchet upwards so there is an order dependence. So we might hope that the reinforcement-learned model could be expressed as some numeric score with say higher values indicating higher preferences and lower values lower ones, and perhaps zero or negative values indicating CSEs that should not be done.

The ratcheting is an attempt to model register pressure. The thresholds are loosely set based upon properties of the weighted local var counts (here I am ignoring for now the "optimize for size" mode), eg the aggressive cutoff is the weight of the 13th highest ranked local var (pre-CSE). More broadly one can imagine that the model would have access to general properties of the lcl var weight distribution, expressed as a sort of CDF, eg we would like to know roughly how many lcl vars do we have to consider to reach say 90% of all (weighted) local var occurrences.

Broadly speaking a particular CSE we will add a new local var whose weight we can compute (from the def and use weights) and remove weight from any of the locals. The current heuristic roughly estimates the first (via the ratchet) but does not take into account the second. Thus for instance if we CSEd a large computation tree to affect a hoist within a loop we might actually expect the overall register pressure to decrease. Thinking of this all as a sort of Markov Decision Process, as we do one CSE the viability of the others will shift.

If we want the model to handle this sort of order dependence, then after each CSE we need to re-evaluate the scores of the remaining candidates.

I will note along these lines that the current CSE heuristic has some quirks that seem interesting. For example in a simple case with just two candidates A & B, if we order them A, B we do both CSEs, if we order them B, A, we do just B, because B is is call-crossing and has a higher ratchet and makes A look unattractive. This struck me as odd.

So, factors that might feed into the model:

current distribution of local var weights
- perhaps per register class (?) right now we just look at int regs
- with callee saves and call crossing CSEs taken into account
impact on this distribution for each candidate
- so overall "score" includes benefit from the CSE plus anticipated extra costs in prolog & for spills
some kind of code complexity metric (maybe: how likely is the distribution going to lead to spills)
whether this CSE is GTF_MAKE_CSE (presumably those do better)
current cost data (weighted and unweighted costs, number of defs & uses, etc)

Some preliminary sub problems to examine:

what do these weighted local distributions look like? Can we model them simply somehow?
do we need a pressure heuristic?
do we see cases now where even with the current pressure heuristic we CSE too much?
- modify existing policy to keep going, and look for cases where perf score increases
how good is the current heuristic
- CSE randomly and see how well the current model fares overall versus the empirical distribution
- here the figure of merit might be perf score / no CSE perf score

AndyAyersMS commented 8 months ago

Here is some data on the current heuristc, from the asp.net SPMI collection [full table here]

115838 methods, 34802 methods with cses; 193807 cse candidates, 56973 cses
NumCandiates [count, odds]: ... details ...
 1 [n: 10259 o: 0.53]:  0: 4836;  1: 5423
 2 [n:  4913 o: 0.44]:  0: 1760;  1: 2028;  2: 1125
 3 [n:  4058 o: 0.36]:  0: 1351;  1: 1329;  2: 1031;  3:  347
 4 [n:  3454 o: 0.37]:  0:  788;  1: 1100;  2:  860;  3:  570;  4:  136
 5 [n:  2576 o: 0.35]:  0:  586;  1:  497;  2:  792;  3:  470;  4:  175;  5:   56
 6 [n:  1627 o: 0.31]:  0:  264;  1:  452;  2:  474;  3:  182;  4:  227;  5:   21;  6:    7
 ...

I was surprised to see such low odds for accepting candidates, on average only about 0.25. So there is certainly headroom to be more aggressive.

For these methods with small numbers of candidates it should be feasible to exhaustively enumerate all the possible sets of CSEs and look at the resulting spectrum of perf scores, and chart out how well the existing heuristic performs (as measured by perf score) vs the best score. For larger sets we can at least attempt to gather a decent sample of data.

Working on that now.

AndyAyersMS commented 8 months ago

Here is some initial data on the performance of the current heuristic.

Using the asp.net SPMI collection I sorted 100 candidates from each of the categories above, and then ran $2^n + 1$ experiments: the baseline run and then jitted once for each different subset of CSEs possible (say there were 5 CSE candidates, this is 33 runs), and then looked at how often the default heuristic got the best possible perf score, and the geomean ratio of the default perf score to the best across those 100 runs.

The preliminary data looks like this:

  --- 1 candidate cases: default heuristic was optimal in 90 of 100 cases; geomean 1.003
  --- 2 candidate cases: default heuristic was optimal in 81 of 100 cases; geomean 1.006
  --- 3 candidate cases: default heuristic was optimal in 65 of 100 cases; geomean 1.008
  --- 4 candidate cases: default heuristic was optimal in 62 of 100 cases; geomean 1.006
  --- 5 candidate cases: default heuristic was optimal in 50 of 100 cases; geomean 1.012
  --- 6 candidate cases: default heuristic was optimal in 49 of 100 cases; geomean 1.006

So the rough trend is that as the number of candidates increases, the likelihood of the default heuristic getting the best possible perf score decreases, and in aggregate the default heuristic falls short of the best perf score by around 0.6% or so.

This data took a few minutes to gather; I'll allow it to run on more cases overnight.

I have some rough edges to polish here and to explore some of the even larger candidate cases I'll need some kind of monte-carlo strategy as the exponential number of runs start becoming a factor.

AndyAyersMS commented 8 months ago

Here is the "complete" set of results from asp.net, up through cases with 11 cse candidates. My current approach is going to take to long to get data for higher numbers of candidates, eg for an 11 candidate case we need 2^11 runs of SPMI over the method context varying the CSE mask, so the entire set of 526 case required over 1 million runs.

  ---  1 candidate cases: baseline heuristic was optimal in 8899 of 10259 cases 86.74%; geomean 1.005
  ---  2 candidate cases: baseline heuristic was optimal in 3782 of  4913 cases 76.98%; geomean 1.007
  ---  3 candidate cases: baseline heuristic was optimal in 2757 of  4058 cases 67.94%; geomean 1.006
  ---  4 candidate cases: baseline heuristic was optimal in 2087 of  3454 cases 60.42%; geomean 1.005
  ---  5 candidate cases: baseline heuristic was optimal in 1382 of  2576 cases 53.65%; geomean 1.010
  ---  6 candidate cases: baseline heuristic was optimal in  825 of  1627 cases 50.71%; geomean 1.006
  ---  7 candidate cases: baseline heuristic was optimal in  492 of   999 cases 49.25%; geomean 1.006
  ---  8 candidate cases: baseline heuristic was optimal in  522 of   979 cases 53.32%; geomean 1.004
  ---  9 candidate cases: baseline heuristic was optimal in  357 of   711 cases 50.21%; geomean 1.004
  --- 10 candidate cases: baseline heuristic was optimal in  278 of   637 cases 43.64%; geomean 1.004
  --- 11 candidate cases: baseline heuristic was optimal in  236 of   526 cases 44.87%; geomean 1.004

Next steps from here:

Consider making the above faster. We should be able to use a release jit (if we make perf score and other metrics available) and run the various cases concurrently. That would speed up the data gathering by maybe 20x. We could also fan out to multiple machines...
In some of the runs above the default heuristic is able to get a better perf score than any of the masked variants. Figure out why.
In some of the runs above the SPMI replay fails with some CSE subsets. The ones I looked at were missing helper call infos for write barrier helpers. See if we can perhaps extend the SPMI collection to make replay more robust.
I have been assuming there is no order dependence in in the masked CSE replays. Say there are 3 candidates: the above process will do 8 runs: default, {}, {1}, {1, 2}, {1, 3}, {2, 3}, {1, 2, 3}. But if there is order dependence, we should also be running the various permutations: {2, 1}, {3, 1}, {3, 2}, {1, 3, 2} {2, 1, 3}, {2, 3, 1}, {3, 1, 2}, {3, 2, 1}. We might need to do this anyways since we will likely be estimating the reward (benefit) of the CSE via some sort of "temporal difference" scheme. At any rate we should validate the order dependence or independence (order dependence may explain point 2 above).
Even if there is no order dependence in results there might be one in modelling. Consider the case of a nested CSE. If 2 nests within 1, then we get slightly different IR if we do {1, 2} vs {2, 1} -- abstractly this should not matter for codegen but in practice it might. But when looking at rewards, the reward for 2 will be diminished if we do 1 first, and the reward for 1 will be diminished if we do 2 first (and it's possible that 2 becomes non-viable after 1, if all its uses are subsumed by uses of 1).
Assuming we're happy with the data gathered above, the next step is to figure out how to make us of it to craft a policy. The rough thinking is to come up with a set of observations that we think are relevant (we already have some) and build some sort parameterized evaluation function. We then redo the runs above and record the metrics at each step, and then use the reinforcement learning algorithm to gradually modify the evaluation function so that it increasingly is able to pick the optimal sequences of CSEs.
We should employ good hygene here and not over-fit, so doing some kind of 5 or 10-fold cross validation seems sensible. That means running the above 10 times.
We should also try other collections / arches /os to see how well that aspect generalizes.
All of this is "perf score optimal" and may not reflect reality, so we still need to sanity check with actual results somehow.

AndyAyersMS commented 8 months ago

I figured out what was leading to (2) above -- I'd forgotten that the config settings get parsed as hex, so for the 4+ candidate cases the tool was not properly exploring the full subset of possibilities. With that fixed I don't see any cases where the default heuristic can do better than a subset.

Updated data shows that for the 4+ cases the heuristic is now a bit worse (not surprising as some of those missing subsets were going to be good ones) and the upside opportunity is a bit larger (also not surprising). Also added numbers on how many SPMI runs are required, note as the candidate counts increase the evaluation takes exponentially more runs, but the number of candidates falls off so the growth overall is a bit milder...

  --- 1 candidate cases: baseline heuristic was optimal in 8899 of 10259 cases 86.74%; geomean 1.005 (30777 runs)
  --- 2 candidate cases: baseline heuristic was optimal in 3782 of  4913 cases 76.98%; geomean 1.007 (24565 runs)
  --- 3 candidate cases: baseline heuristic was optimal in 2757 of  4058 cases 67.94%; geomean 1.006 (36522 runs)
  --- 4 candidate cases: baseline heuristic was optimal in 2030 of  3454 cases 58.77%; geomean 1.006 (58718 runs)
  --- 5 candidate cases: baseline heuristic was optimal in 1311 of  2576 cases 50.89%; geomean 1.012 (85008 runs)
  --- 6 candidate cases: baseline heuristic was optimal in  734 of  1627 cases 45.11%; geomean 1.009 (105755 runs)
  --- 7 candidate cases: baseline heuristic was optimal in  415 of   999 cases 41.54%; geomean 1.008 (128871 runs)

AndyAyersMS commented 8 months ago

Fixed some other issues; latest full run data

  --- 1 candidate cases: baseline heuristic was optimal in 8899 of 10259 cases 86.74%; geomean 1.005 (30777 runs)
  --- 2 candidate cases: baseline heuristic was optimal in 3782 of  4913 cases 76.98%; geomean 1.007 (24565 runs)
  --- 3 candidate cases: baseline heuristic was optimal in 2757 of  4058 cases 67.94%; geomean 1.006 (36522 runs)
  --- 4 candidate cases: baseline heuristic was optimal in 2030 of  3454 cases 58.77%; geomean 1.006 (58718 runs)
  --- 5 candidate cases: baseline heuristic was optimal in 1311 of  2576 cases 50.89%; geomean 1.012 (85008 runs)
  --- 6 candidate cases: baseline heuristic was optimal in  734 of  1627 cases 45.11%; geomean 1.009 (105755 runs)
  --- 7 candidate cases: baseline heuristic was optimal in  415 of   999 cases 41.54%; geomean 1.008 (128871 runs)
  --- 8 candidate cases: baseline heuristic was optimal in  454 of   979 cases 46.37%; geomean 1.009 (251603 runs)

AndyAyersMS commented 8 months ago

I have a version of MCMC running for larger instances, and have verified that it matches up pretty well to exhaustive enumeration in the medium-sized cases. The trend noted above continues:

  ---  6 candidate cases: baseline heuristic was optimal in 734 of 1627 cases 45.11%; geomean 1.009 baseline win from CSE 0.960 max win 0.951 (105755 runs in    775,494ms)
  ---  7 candidate cases: baseline heuristic was optimal in 415 of  999 cases 41.54%; geomean 1.008 baseline win from CSE 0.957 max win 0.949 (128871 runs in  1,742,522ms)
  ---  8 candidate cases: baseline heuristic was optimal in 451 of  979 cases 46.07%; geomean 1.009 baseline win from CSE 0.958 max win 0.949 (251603 runs in  3,634,848ms)
  ---  9 candidate cases: baseline heuristic was optimal in 278 of  711 cases 39.10%; geomean 1.008 baseline win from CSE 0.956 max win 0.949 (182727 runs in  5,070,407ms)
  --- 10 candidate cases: baseline heuristic was optimal in 183 of  637 cases 28.73%; geomean 1.008 baseline win from CSE 0.959 max win 0.951 (163709 runs in  6,406,325ms)
  --- 11 candidate cases: baseline heuristic was optimal in 142 of  526 cases 27.00%; geomean 1.008 baseline win from CSE 0.957 max win 0.949 (135182 runs in  7,543,211ms)
  --- 12 candidate cases: baseline heuristic was optimal in 231 of  535 cases 43.18%; geomean 1.009 baseline win from CSE 0.966 max win 0.957 (137495 runs in  8,664,782ms)
  --- 13 candidate cases: baseline heuristic was optimal in  86 of  342 cases 25.15%; geomean 1.009 baseline win from CSE 0.962 max win 0.953 (87894  runs in  9,376,476ms)
  --- 14 candidate cases: baseline heuristic was optimal in  63 of  265 cases 23.77%; geomean 1.009 baseline win from CSE 0.943 max win 0.935 (68105  runs in  9,937,161ms)
  --- 15 candidate cases: baseline heuristic was optimal in  51 of  237 cases 21.52%; geomean 1.009 baseline win from CSE 0.959 max win 0.951 (60909  runs in 10,473,161ms)
  --- 16 candidate cases: baseline heuristic was optimal in  75 of  262 cases 28.63%; geomean 1.007 baseline win from CSE 0.967 max win 0.960 (67334  runs in 11,120,183ms)
  --- 17 candidate cases: baseline heuristic was optimal in  72 of  246 cases 29.27%; geomean 1.007 baseline win from CSE 0.976 max win 0.969 (63222  runs in 11,702,941ms)
  --- 18 candidate cases: baseline heuristic was optimal in  34 of  196 cases 17.35%; geomean 1.007 baseline win from CSE 0.959 max win 0.952 (50372  runs in 12,185,341ms)
  --- 19 candidate cases: baseline heuristic was optimal in  29 of  198 cases 14.65%; geomean 1.010 baseline win from CSE 0.968 max win 0.959 (50886  runs in 12,697,428ms)
  --- 20 candidate cases: baseline heuristic was optimal in  24 of  172 cases 13.95%; geomean 1.010 baseline win from CSE 0.961 max win 0.951 (44204  runs in 13,134,515ms)
  --- 21 candidate cases: baseline heuristic was optimal in   6 of  110 cases  5.45%; geomean 1.010 baseline win from CSE 0.953 max win 0.944 (28270  runs in 13,445,214ms)
  --- 22 candidate cases: baseline heuristic was optimal in  13 of   97 cases 13.40%; geomean 1.006 baseline win from CSE 0.955 max win 0.949 (24929  runs in 13,706,788ms)
  --- 23 candidate cases: baseline heuristic was optimal in   4 of   92 cases  4.35%; geomean 1.011 baseline win from CSE 0.966 max win 0.955 (23644  runs in 14,004,816ms)
  --- 24 candidate cases: baseline heuristic was optimal in   5 of  119 cases  4.20%; geomean 1.013 baseline win from CSE 0.967 max win 0.955 (30583  runs in 14,316,467ms)
  --- 25 candidate cases: baseline heuristic was optimal in   7 of   63 cases 11.11%; geomean 1.011 baseline win from CSE 0.967 max win 0.957 (16191  runs in 14,510,376ms)
  --- 26 candidate cases: baseline heuristic was optimal in   2 of   97 cases  2.06%; geomean 1.010 baseline win from CSE 0.974 max win 0.964 (24929  runs in 14,789,976ms)
  --- 27 candidate cases: baseline heuristic was optimal in   1 of   74 cases  1.35%; geomean 1.022 baseline win from CSE 0.954 max win 0.934 (19018  runs in 15,023,453ms)
  --- 28 candidate cases: baseline heuristic was optimal in   7 of   73 cases  9.59%; geomean 1.015 baseline win from CSE 0.974 max win 0.960 (18761  runs in 15,263,888ms)
  --- 29 candidate cases: baseline heuristic was optimal in   1 of   69 cases  1.45%; geomean 1.009 baseline win from CSE 0.975 max win 0.966 (17733  runs in 15,469,362ms)
  --- 30 candidate cases: baseline heuristic was optimal in  17 of   66 cases 25.76%; geomean 1.011 baseline win from CSE 0.945 max win 0.934 (16962  runs in 15,683,842ms)

Using the above I can now generate "perf-score" optimal for cases with small CSEs, and can elaborate the entire value function $V(s, a)$ either explicitly or graphically... eg

Here green shows the optimal policy, dog-eared the current policy, square boxes are terminal states.

As noted in (4) above there doesn't seem to be order-dependence in the values; if so that greatly cuts down the size of the state space. If there is order dependence then the state space grows faster than $n!$ -- actual growth is https://oeis.org/A000522. If there is no order dependence than it's just $2^n$. Still need MCMC for larger cases, but can cover a much greater fraction.

So I plan to screen all this data to see if I ever see a case where the order of CSEs matters for final perf score. If it never happens or is suitably rare, we'll just ignore it.

The data above can't directly be used to craft a policy, since the state is a specific method instance plus a sequence of CSEs. In reality the jit won't have already seen a given method... so we need to generalize the state and try building a policy that way.

My initial plan is to use a policy gradient method. We know some of the factors that should influence CSEs, these are already considered by the current policy, though perhaps not in optimal form:

weighted and unweighted costs and sizes of the CSE uses and defs
numbers of defs and uses
whether the CSE is call crossing nor not
an estimate of register pressure -- currently we use the weighed cost of the Nth integer local, where N is initially found from a scan of the local var table, and increases somewhat with each successive CSE

So we can start with these. Since there is a large dynamic range for weighed costs I am tempted to use a log to keep the data a bit more reasonable. We also need to assign some preference to stop the process, so there is one extra action to model. The initial model will likely be linear, so the preference for an action (CSE) will be the dot product of the observations vector based on that CSE and ambient state, and a parameter vector that is all numeric and provides weights for each observation. This can be generalized to include nonlinear functions of observations (or combinations of observations) if necessary.

For the reward we can use absolute perf scores, but this doesn't generate well across methods, so I am tempted to use the relative perf score over doing no CSEs (or could also be the relative score compared to the best known score).

AndyAyersMS commented 7 months ago

For reasons I find hard to justify I decided my $V(s)$ and $Q(s,a)$ graph needed colormapping. This is from the exploration done by Policy Gradient. Here (yellow) doing all 3 CSEs is best, in any order.

Top number in each state is min perf score, bottom is the average. As the heuristic evolves one hopes the average approaches the min, as the fumbling initial steps are more or less forgotten (image below is after 100 runs).

Double-circle indicates what the current jit CSE heuristic does.

image - 2024-01-22T172846 515

AndyAyersMS commented 7 months ago

Status update...

The basic Policy Gradient mechanism seems to be working correctly, though it is finicky enough that I am not 100% sure. The step size seems to be a fairly tricky aspect, and there are some novel (or at least novel-seeming) aspects to the problem space we're exploring that may complicate an assessment.

With the latest driver program (https://github.com/dotnet/jitutils/pull/391) we can take an SPMI collection, find the subset of the methods with CSE candidates, and then randomly or deterministically run over a subset of that subset, doing one or more of:

Gather features -- dump out the feature vectors of CSE candidates, useful for looking for empirical distributions (for normalization and centering) and for unusual feature sets
MCMC -- purely random exploration (as in the examples above), trying to build up intuition about how much improvement is possible.
Policy Gradient -- optimize the parameters of the linear model embedded in the jit. The perhaps novel aspect here is that we are simultaneously optimizing across many methods, and it is not clear to me what the best approach for that might be.
- Currently the driver code will round-robin through the subset, and for each method, minibatch 25 "trajectories" from the baseline parameter set (these can be done concurrently) using the stochastic policy, then updates the parameters based on the average gradient correction (modulated by the step size) from that minibatch, and use that parameter set for the next method in the round robin. So if a method has candidates with unusual or unique features it might take quite a while to get back looking at their impact.
- Every so often we then use those parameters to evaluate the performance of a greedy policy across all methods in the bigger subset, to gauge "global" progress.
- Not all problems are equally complex or difficult, and it feels like this approach may waste a lot of time confirming it has good results on easy problems, so more experimentation here is needed.
- In some cases the stochastic policy (rarely) and greedy policy (never) get good results. I have been analyzing some of these by hand and adding to the model where I think it may help. Likely more of this is needed too.

The screenshot below shows a sample run.

Here the driver is looking at a 5 method subset out of 35557 possible methods. It first runs MCMC on these 5; since their candidate numbers are small MCMC exhaustively tries all combinations, and for two of the methods finds better results than the current JIT heuristic.

The driver then tries to create a better policy via iterative updates of the policy parameters. Each row shows the average minibatch result for a method, blue means (the last run of the minibatch) matches the current jit heuristic, and green means it improves on it, white or yellow is worse. After 25 round-robin minibatches it does a summary vs the current jit and the best known result on the select subset (best known from stochastic policy explorations), and then the result of using the current parameter set in a greedy policy across all 35557 methods. Here the driver thinks it has found on aggregate a roughly 0.2% better heuristic with just this small amount of work, but there are a substantial number of regressions too.

It turns out to be not too hard to beat the current heuristic (w/similar numbers of better/worse) by just doing all CSEs, so this result is perhaps not all that impressive -- what would be impressive here is getting closer to the best results (which we only know for the 5 methods) with minimal regression. And hence my focus on finding better ways to encourage it to stop early.

Some ideas on next steps:

try and leverage a standard policy gradient somehow ... might require linking the code into the JIT
- if this works out, use the Value modelling there (based on neural nets)
use more robust methods like proximal policy optimization (either implement or leverage)
make sure all this can be done with release builds (should speed things up 4x or so)
possibly build a streaming mode into SPMI so we can have instances warmed up and waiting for work, rather than launching new processes each time (should gain 2x or something)
rely on MCMC or stochastic exploration to help the driver capture more details about methods that are resistant to optimization (say dumps)
automatically rotate the focused method subset, picking methods from the full subset that perform poorly, and leaving out ones that consistently do well
update the driver with proper command line / config file. There are a lot of options and likely more to come. (I have at least verified it runs properly on more than one machine -- an arm64 host at that!)

AndyAyersMS commented 7 months ago

Have been studying more examples where Policy Gradient struggles. In one such we have a CSE that is not live across call in flow, but is live across call in the LSRA sequence, so it leads to a spill.

It could be that the various "pressure" estimates are inherently flawed in this way, as there is no easy way to anticipate the ordering LSRA will choose and so no clear way to figure out the possible max conflict (pressure) points.

I have a feature now which roughly tries to assess the vulnerability of a CSE to shenanigans; currently this is the maximum reverse postorder number diff between any two appearances (def or use), with the idea that LSRA order might be something like this. I suppose I could add a check there if that range contains calls?

Currently we purge the DFS after RBO, but we could keep it viable and query it in CSE with the idea that it is still a decent estimate of what the DFS would be like. Or just run a new DFS (which we might do anyways, if we follow up on the ideas in https://github.com/dotnet/runtime/issues/97243#issuecomment-1928691033).

AndyAyersMS commented 7 months ago

Another interesting case. Here we hoist a NOT that is under and AND, not realizing that andn is possible. Not sure we can mark all similar kinds of subtrees as GTF_NO_CSE and I wonder if we should be removing these and letting ML figure this out instead.

I suppose I can try and check if a CSE candidate is an ADD underneath an indir and mark those as needing extra scrutiny.

Also we CSE and hoist rbx+0x18 but don't use it in the mov, presumably that is marked as an addr mode, but even in the no CSE case we redundantly compute this and decouple it from cmpxchg. From what I can tell cmpxchg supports at least an offset so this (in the before code) is a wasted register.

Might make sense for CSE to accumulate all these GTF_DONT_CSE uses, and if we decide to do the CSE for other reasons, perhaps do the no cse cases as well, otherwise we may just be creating more pressure as whatever operands these add modes need have to stay live along with the CSE.


;; NO CSEs
G_M39425_IG03:  ;; offset=0x0015
       mov      eax, dword ptr [rbx+0x18]
       mov      dword ptr [rsp+0x24], eax
       lea      rcx, bword ptr [rbx+0x18]
       andn     edx, edi, eax
       or       edx, esi
       lock     
       cmpxchg  dword ptr [rcx], edx
       cmp      eax, dword ptr [rsp+0x24]
       jne      SHORT G_M39425_IG05

;; CSEs

       lea      rdi, bword ptr [rbx+0x18]
       mov      ebp, r8d
       not      ebp

G_M39425_IG03:  ;; offset=0x001C
       mov      eax, dword ptr [rbx+0x18]
       mov      dword ptr [rsp+0x2C], eax
       mov      ecx, ebp
       and      ecx, eax
       or       ecx, esi
       lock     
       cmpxchg  dword ptr [rdi], ecx
       cmp      eax, dword ptr [rsp+0x2C]
       jne      SHORT G_M39425_IG05

AndyAyersMS commented 7 months ago

At this point I'm mainly watching how the PG evolves and looking at methods that it doesn't handle well, trying to understand what about the candidates in those cases is different/unusual, to see if it's a mistaken feature value, a missing feature value, or whatnot.

An alternative idea is to use MCMC (or whatever) to broadly classify candidates as good or bad (since for the most part order does not seem to matter, so goodness/badness is not contextual, and features don't evolve much over the course of an episode). Good if they are part of the best CSE sequence, and bad if not. Then grab all these candidate features, and compute some sort of nearness/clustering analysis in the feature space (or build a classifier), to see if we can distinguish good or bad from the features; if we can't, then how can we expect a policy to distinguish them?

If we find a pair of candidates with similar features where one is good and the other is bad, study those cases to figure out what is going on.

AndyAyersMS commented 7 months ago

Also intrigued by the ideas in Simple random search provides a competitive approach to reinforcement learning which operates in the parameter space directly. Might be an interesting alternative to implement, and is similar to things I've tried in the past.

AndyAyersMS commented 7 months ago

Also for features, once we have decent/stable results we can try dropping features from the set to see how much the (lack of) a feature impairs overall results.

Might also search for correlated features and try and decorrelate them (eg gram-schmidt orthogonalization or similar). Say the value of feature A predicts value of feature B pretty well, then it might be worth subtracting of the predicted B (via A) from the actual B, so the reported features are A and B-(A's prediction of B).

The two ideas above are similar, if A and B are correlated then having them both provides little additional benefit over having just one of them.

AndyAyersMS commented 6 months ago

Have been looking into improving efficiency. The PolicyGradient has two phases -- a one-off where it runs method by method (using Parallel.For to launch multiple SPMI instances), and a batch where it processes them all (using SPMI's built in parallelism).

;; method by method
 6231 methods in 45716ms
;; batch
35482 methods in 45099ms

So the method by method has ~6x overhead...

I have tried building all the stuff needed into release, it is (perhaps) nearly there, but that won't explain the above, as both versions are running checked jits.

The batch mode is what I'm used to on this box, about 1K methods/sec.

AndyAyersMS commented 6 months ago

Here's a trial that enable the new heuristic: https://github.com/dotnet/runtime/pull/98776

Have added some notes there on the behaviors.

JulieLeeMSFT commented 1 month ago

Closing as activities during .NET 9 is done. We will revisit work to do during .NET 10 planning.