Add simulator for merge train strategies

ruuda commented 2 years ago

Context: After https://github.com/channable/hoff/issues/77#issuecomment-1179430191 I was a ~~bit~~lot nerd-sniped, so I wrote a simulator to better understand how different strategies affect the backlog. I don’t expect anybody to review this, I just want to put this out there to share the results.

Background

Hoff has a backlog of approved pull requests that we want to merge. Pull requests can leave the backlog either by getting merged, or by being confirmed as failing. Pull requests in the backlog are waiting. I believe Hoff’s №1 priority should be to minimize the total time spent waiting.¹ That doesn’t uniquely define the strategy, because making an old or a new pull request wait one more step has the same effect on total wait time, but very different effects on the wait time distribution.

I set up a simulator that simulates 250 pull requests coming in following a Poisson process, with an additional gamma-distributed delay to simulate review. This means pull requests arrive roughly in order of ascending id, but not exactly. There is no seasonality (e.g. only generating pull requests during office hours), but still, this should give us a rough idea of how strategies behave under various loads. I assume that 85% of the pull requests will succeed to build, and there are no flaky builds.

^{1 This was not obvious to me at first, I thought its goal should be to maximize the number of pull requests merged per build. But if you have some pull requests that are likely bad, maximizing the number of pull requests merged means you should prefer extending master over confirming that these pull requests are bad, and they spend a long time waiting.}

Regimes

We have pull requests coming in at a certain rate, say r_p, and we can test pull requests at a certain rate, say r_t. We have r_t = j / t_b, where j is the number of parallel builds we can do, and t_b is the average time per build. Define r_p = r_t as the critical point. We can distinguish two regimes:

Subcritical, r_p < r_t. On average, we have more capacity to build than pull requests coming in, so no long-term backlog should build up.

Supercritical, r_p > r_t. On average, we have more pull requests coming in than we can build, so unless we build multiple pull requests per build, we don’t even have any hope of clearing the backlog.

In the plots below, I indicate the criticality, defined as r_p / r_t. As we get closer to a criticality of 1.0, backlogs start to grow, and eventually grow without bound. This happens even before the critical point, because if we were briefly unlucky over some time window and more pull requests arrived than we could build, then after that, the backlog does not shrink on average. With parallel builds this happens sooner, because parallel builds are speculative: a failed build can invalidate other in-progress builds, which reduces the effective r_t.

In the supercritical regime, it is a bit tricky to talk about the wait time distribution, because most pull requests do not get merged at all, so the longer you would run the simulation, the wider the tail of the distribution would grow. In some cases the wait time grows without bound over time, but for some strategies we can still conclude some useful things.

Strategies

A strategy determines what Hoff will build next, and in the case of parallelism, on top of what.

Sequential

Classic — This is the strategy that Hoff originally used: build the pull request with the lowest id first.

Fifo — This is the strategy that Hoff uses since #25: build the least-recently approved pull request first.

Lifo — Build the most-recently approved pull request first.

Bayesian — What I proposed in https://github.com/channable/hoff/issues/77#issuecomment-1179430191: minimize the expected wait time, based on an estimated is-good probability per pull request.

Of these, classic, fifo, and lifo all build a single pull request per build, while Bayesian can build “rollups” of multiple pull requests. Typically humans consider classic and fifo more or less “fair”, and lifo very unfair. Lifo has desirable properties in the supercritical regime though, but it wouldn’t be practical because it can easily be gamed by closing and re-approving a pull request. Bayesian is a little under-specified: when multiple pull requests have the same is-good probability, it needs to break ties in some order. I opted for classic (by lowest pull request id) in this case.

Parallel

Classic, fifo, and lifo can all be extended to parallel builds using optimistic speculation. They assume that the build which is in progress will succeed, and then apply the original strategy to determine which pull request to build next. This creates a “train” of builds. Every build only includes a single pull request more than its parent. I think the parallel extension of fifo is what @rudymatela proposes in https://github.com/channable/hoff/issues/77#issuecomment-1184294121. (It’s also what I originally thought would be the best extension to explore first, but I have since changed my mind about this, and I now think rollups are inevitable, because optimistic speculation breaks down near the critical point, and adding more parallel builds does not help, because as you speculate further ahead, the probability that the scenario will succeed goes to zero.)

My implementation of Bayesian as described above never builds things in parallel. I added Bayesian parallel that is a refinement of https://github.com/channable/hoff/issues/77#issuecomment-1179515757: generate candidates from every build in progress, by considering the state where it succeeds and the state where it fails, and apply the original Bayesian strategy in that state. Then pick the candidate that has the greatest expected reduction in backlog size. I don’t know if Bayesian parallel actually minimizes the expected total wait time, I suspect it does not. In particular, if Bayesian decides to build all pending pull requests in a single build, then Bayesian parallel will not use the additional build slots, but it would be better to use those build slots and break the original build in smaller steps. There is room for research here!

Results

Sequential

Scenario	classic	fifo	lifo	Bayesian
15% critical
80% critical
100% critical
110% critical
200% critical

Observations:

The table is a bit misleading, as the y-axes in the plots are not synchronized. Some backlogs grow faster than others!
Far below the critical point, at 15%, the strategy hardly matters, the backlog is empty most of the time.
At 80% criticality, we can keep up, but there is almost always a backlog, and depending on the strategy, it’s longer than ~5 pull requests 75% of the time. Long before the critical point, wait times can become a nuisance!
At 80% criticality, the Bayesian strategy beats all other strategies on all metrics (backlog size, mean wait time, median wait time, p90 wait time). But that’s not entirely a surprise, because it can make more progress per build than the others.
Fifo, lifo, and classic all have the same backlog size and total wait time. This is because all strategies reduce the backlog size by 1 per build, and incur a wait time of 1 * backlog_size. The difference is only in the wait time distribution. (The difference in mean is due to differences which pull requests have been built when the simulation stops.)
Lifo has a much better median wait time, and also a slightly better p90 wait time than fifo, at the cost of very rare very long delayes.
Classic sits in between fifo and lifo, which makes sense because its behavior is a mix of the two. (Sometimes an older pull request gets approved and “skips the queue”.)
At 100–110% criticality, Bayesian can still keep up but the other strategies no longer can, for fifo and classic the wait time shoots up, but for lifo the median remains very reasonable. If you get your pull request merged at all, half of the time it’s within 1.6 builds!
At 200% criticality no strategy can keep up, but Bayesian and lifo continue to have a reasonable median. In one sense this is meaningless, because you never operate Hoff in the supercritical regime, it’s not sustainable. But in another sense, daily operation is supercritical for brief periods of time and then subcritical again, so it is worth understanding how it behaves.

Parallel

Scenario	classic	fifo	lifo	Bayesian parallel
15% critical, 2 builds
80% critical, 2 builds
105% critical, 2 builds
15% critical, 4 builds
60% critical, 4 builds
101% critical, 4 builds

Observations:

Note, 15% criticality for 2 builds is twice the incoming pull request rate of 15% criticality for 1 build!
At 15% criticality and 2 or 4 builds, all strategies behave pretty much the same, except Bayesian, which has a fatter tail for the wait times.
At 80% criticality and 2 builds, the median wait time for Bayesian is beaten only by fifo, but the total wait time is the worst of all.
At 60% criticality and 4 builds, Bayesian parallel can’t even keep up while the others can (at least the median, their 75th percentile backlog size also grows over time). Classic has a bit better median wait time than fifo, at the cost of a higher p90.

Conclusions

Far from the critical point, the strategy doesn’t really matter, but as the load increases, their differences start to show.
Without parallelism, the Bayesian strategy works well, but when parallelism is allowed, the other strategies with optimistic speculation outperform it.
Lifo has good wait median wait times but is unfair for humans, classic gets a bit of the advantage while being arguably more fair.

Future work

Some things to try:

A strategy based on logic instead of probabilities. E.g. build a rollup, when it fails, make a new one with half the pull requests in it; bisect the failure. Possibly combined with probabilities to guide the search.
Refine the Bayesian parallel strategy so it can make a global decision about what the best n builds to make are, instead of having to make one local step at a time.

ruuda commented 2 years ago

Hmm, I just discovered a critical bug in this that could invalidate all outcomes ... I never started more than one build per event :see_no_evil:

ruuda commented 2 years ago

I’m using this pull request to dump some thoughts:

I made some changes to the parallel Bayesian algorithm and it now does quite well, but it is starting to look more like intricate heuristics and less like a simple and elegant algorithm now.
The Bayesian algorithm has the prior is-good probability as parameter, and setting it slightly lower than the actual probability performs a bit better than setting it to the true probability, or higher. (If we set it too low, Bayesian kind of reduces to one of the other strategies that do a single pull request per build.)
The Bayesian algorithm could do better in the low-criticality regime if it knows how many build slots are free. Then it could break down what it wants to do into multiple steps. If they all succeed we wasted some build minutes, but if the longer train fails, we could still make some progress faster if the shorter ones pass.
Related to this, it could also make sense to wait and not immediately start a new build when a build slot is available, to be able to build more pull requests per build.
One insight by Jasper at the Coffee Code & Chat meetup: we don’t need to build a pull request in isolation to confirm that it is bad. If a build with {x, y1, y2, ..., yn } failed, and we later merge all of y1, ..., yn, then we know for sure x is bad without having to spend another build confirming that.
Given the above, it might make sense for the Bayesian algorithm to pick sets of pull requests to build that can confirm failures indirectly this way, over trying to build pull requests in isolation. Or at least, it should try less hard to confirm failures and focus more on making progress, because the failures may be confirmed indirectly.

ruuda commented 2 years ago

Hah, so just as I thought I had a pretty neat solution, the nerd snipe gets way worse. I was confused about how to update the probabilities because the order in which I applied the updates mattered, so I asked about it on Stack Exchange, and the key insight (thanks to user mef) is: we shouldn’t track probabilities per pull request, instead we should have a probability distribution over all 2ⁿ possible states of the n pull requests.

This makes a lot of things more elegant. Previously, I was struggling a bit with the fact that when a build fails, if it contained more than one pull request, we don’t know which one(s) are bad, so we made zero progress. But that doesn’t feel quite right, it did make that set of pull requests more suspect, we learned something. But how to quantify that? With a probability distribution over all possible states, this is obvious: we reduced the entropy of the distribution! So beautiful!

Also previously I had this feeling that the Bayesian approach was a different approach than something based more on set operations. E.g. an alternative strategy would be to bisect failed builds, split the set to test in half every time. It seems like a worthwhile approach, but not something that naturally falls out of my previous approach. But if you think of the distribution over all possible PR states, then these two approaches are no longer incompatible. Learning the outcome of a build means that you learn that all counterfactual states have probability 0, and from that you can conclude that when a PR is good/bad in every remaining state, it must be good/bad. But we learn more than just this binary information. This also guides us on e.g. which subset to pick, when we want to bisect a build.

Now the optimization goal is almost to find the subset of pull requests to build, that when we learn its outcome, maximizes the expected reduction in entropy. I suspect just that strategy would do well on its own. But there is also the wait time. Building a particular subset might give us a lot of information, yet not reduce the backlog at all, so we incur more wait time than an alternative that might reduce the entropy less, but also confirm some PRs. I suspect that different goal can be optimized for with only a slight modification of the strategy.

So plenty of food for thought again:

Given the probabilities for all 2ⁿ states, what is the best subset to build?
Can we also find that subset efficiently, without having to consider all 2ⁿ options?
Is there a representation of the probability distribution that doesn’t require O(2ⁿ) space? Maybe we can compute just the things we need on the fly from a list of prior failed builds?

channable / hoff