Closed ruuda closed 1 year ago
Hmm, I just discovered a critical bug in this that could invalidate all outcomes ... I never started more than one build per event :see_no_evil:
I’m using this pull request to dump some thoughts:
Hah, so just as I thought I had a pretty neat solution, the nerd snipe gets way worse. I was confused about how to update the probabilities because the order in which I applied the updates mattered, so I asked about it on Stack Exchange, and the key insight (thanks to user mef) is: we shouldn’t track probabilities per pull request, instead we should have a probability distribution over all 2n possible states of the n pull requests.
This makes a lot of things more elegant. Previously, I was struggling a bit with the fact that when a build fails, if it contained more than one pull request, we don’t know which one(s) are bad, so we made zero progress. But that doesn’t feel quite right, it did make that set of pull requests more suspect, we learned something. But how to quantify that? With a probability distribution over all possible states, this is obvious: we reduced the entropy of the distribution! So beautiful!
Also previously I had this feeling that the Bayesian approach was a different approach than something based more on set operations. E.g. an alternative strategy would be to bisect failed builds, split the set to test in half every time. It seems like a worthwhile approach, but not something that naturally falls out of my previous approach. But if you think of the distribution over all possible PR states, then these two approaches are no longer incompatible. Learning the outcome of a build means that you learn that all counterfactual states have probability 0, and from that you can conclude that when a PR is good/bad in every remaining state, it must be good/bad. But we learn more than just this binary information. This also guides us on e.g. which subset to pick, when we want to bisect a build.
Now the optimization goal is almost to find the subset of pull requests to build, that when we learn its outcome, maximizes the expected reduction in entropy. I suspect just that strategy would do well on its own. But there is also the wait time. Building a particular subset might give us a lot of information, yet not reduce the backlog at all, so we incur more wait time than an alternative that might reduce the entropy less, but also confirm some PRs. I suspect that different goal can be optimized for with only a slight modification of the strategy.
So plenty of food for thought again:
Context: After https://github.com/channable/hoff/issues/77#issuecomment-1179430191 I was a
bitlot nerd-sniped, so I wrote a simulator to better understand how different strategies affect the backlog. I don’t expect anybody to review this, I just want to put this out there to share the results.Background
Hoff has a backlog of approved pull requests that we want to merge. Pull requests can leave the backlog either by getting merged, or by being confirmed as failing. Pull requests in the backlog are waiting. I believe Hoff’s №1 priority should be to minimize the total time spent waiting.1 That doesn’t uniquely define the strategy, because making an old or a new pull request wait one more step has the same effect on total wait time, but very different effects on the wait time distribution.
I set up a simulator that simulates 250 pull requests coming in following a Poisson process, with an additional gamma-distributed delay to simulate review. This means pull requests arrive roughly in order of ascending id, but not exactly. There is no seasonality (e.g. only generating pull requests during office hours), but still, this should give us a rough idea of how strategies behave under various loads. I assume that 85% of the pull requests will succeed to build, and there are no flaky builds.
1 This was not obvious to me at first, I thought its goal should be to maximize the number of pull requests merged per build. But if you have some pull requests that are likely bad, maximizing the number of pull requests merged means you should prefer extending
master
over confirming that these pull requests are bad, and they spend a long time waiting.Regimes
We have pull requests coming in at a certain rate, say rp, and we can test pull requests at a certain rate, say rt. We have rt = j / tb, where j is the number of parallel builds we can do, and tb is the average time per build. Define rp = rt as the critical point. We can distinguish two regimes:
Subcritical, rp < rt. On average, we have more capacity to build than pull requests coming in, so no long-term backlog should build up.
Supercritical, rp > rt. On average, we have more pull requests coming in than we can build, so unless we build multiple pull requests per build, we don’t even have any hope of clearing the backlog.
In the plots below, I indicate the criticality, defined as rp / rt. As we get closer to a criticality of 1.0, backlogs start to grow, and eventually grow without bound. This happens even before the critical point, because if we were briefly unlucky over some time window and more pull requests arrived than we could build, then after that, the backlog does not shrink on average. With parallel builds this happens sooner, because parallel builds are speculative: a failed build can invalidate other in-progress builds, which reduces the effective rt.
In the supercritical regime, it is a bit tricky to talk about the wait time distribution, because most pull requests do not get merged at all, so the longer you would run the simulation, the wider the tail of the distribution would grow. In some cases the wait time grows without bound over time, but for some strategies we can still conclude some useful things.
Strategies
A strategy determines what Hoff will build next, and in the case of parallelism, on top of what.
Sequential
Classic — This is the strategy that Hoff originally used: build the pull request with the lowest id first.
Fifo — This is the strategy that Hoff uses since #25: build the least-recently approved pull request first.
Lifo — Build the most-recently approved pull request first.
Bayesian — What I proposed in https://github.com/channable/hoff/issues/77#issuecomment-1179430191: minimize the expected wait time, based on an estimated is-good probability per pull request.
Of these, classic, fifo, and lifo all build a single pull request per build, while Bayesian can build “rollups” of multiple pull requests. Typically humans consider classic and fifo more or less “fair”, and lifo very unfair. Lifo has desirable properties in the supercritical regime though, but it wouldn’t be practical because it can easily be gamed by closing and re-approving a pull request. Bayesian is a little under-specified: when multiple pull requests have the same is-good probability, it needs to break ties in some order. I opted for classic (by lowest pull request id) in this case.
Parallel
Classic, fifo, and lifo can all be extended to parallel builds using optimistic speculation. They assume that the build which is in progress will succeed, and then apply the original strategy to determine which pull request to build next. This creates a “train” of builds. Every build only includes a single pull request more than its parent. I think the parallel extension of fifo is what @rudymatela proposes in https://github.com/channable/hoff/issues/77#issuecomment-1184294121. (It’s also what I originally thought would be the best extension to explore first, but I have since changed my mind about this, and I now think rollups are inevitable, because optimistic speculation breaks down near the critical point, and adding more parallel builds does not help, because as you speculate further ahead, the probability that the scenario will succeed goes to zero.)
My implementation of Bayesian as described above never builds things in parallel. I added Bayesian parallel that is a refinement of https://github.com/channable/hoff/issues/77#issuecomment-1179515757: generate candidates from every build in progress, by considering the state where it succeeds and the state where it fails, and apply the original Bayesian strategy in that state. Then pick the candidate that has the greatest expected reduction in backlog size. I don’t know if Bayesian parallel actually minimizes the expected total wait time, I suspect it does not. In particular, if Bayesian decides to build all pending pull requests in a single build, then Bayesian parallel will not use the additional build slots, but it would be better to use those build slots and break the original build in smaller steps. There is room for research here!
Results
Sequential
Observations:
Parallel
Observations:
Conclusions
Future work
Some things to try: