MPP Should Have Single Global Budget

Thinking about #3863, and experimenting with a badly-connected node, suggests to me that we badly mismanage the given pay budget.

I made some large payments of about a millibitcoin or so (large relative to the badly-connected node, it has barely 10 millibitcoins locked up to nodes that turn out to themselves be fairly badly connected when I do some manual 2-hop sendpay probes from the badly-connected node) and when giving maxfeepercent of 1% or 2% it fails, but afterwards with doing 5% it succeeds.

Yet when I look at the listpays report it claims to have sent out only around ~0.6% more than the requested amount, i.e. the feepercent is only 0.6%. In principle the payment solution it found at maxfeepercent=5 should have been discoverable at maxfeepercent=1 or 2. Either listpays is wrong, or we are badly mismanging our fee budget.

I think we should instead make a single budget for a tree of payments, rather than the "budget-stealing" idea suggested in #3863.

Basically:

Add a new struct payment_budget which is initialized to the amount_msat of the fee budget.
- The root payment allocates and constructs this object.
- Child payments copy the pointer to this structure from the parent.
- Could just be a struct amount_msat directly, why do we need a new struct type for it anyway?
Remove the fee-checking in the post-getroute processing.
- Remove the constraints thing, it should not be necessary, but cutting that out may be the hardest part of this effort, as we have some assumptions about it.
Prior to createonion+sendonion+waitsendpay, check the fee of the route (value at last hop minus value at first hop).
- If this fee is larger than the payment_budget, pretend this is a waitsendpay returning with a PAY_ROUTE_TOO_EXPENSIVE (which waitsendpay will not actually return, but is a useful fiction to have in the paymod system) --- invent a payment_result with code=PAY_ROUTE_TOO_EXPENSIVE and pointless values in other fields --- and then call payment_failed, and let paymods figure out what to do about it (we could make a new paymod that does the exempting of channels with high feerates in case of that error).
- If the fee is less than or equal to the payment_budget we deduct it from the payment budget and continue with the createonion/sendonion/waitsendpay sequence.
If any of the createonion sendonion or waitsendpay fail, return the allocated fee back to the payment_budget.
- Create new payment_createonion_fail etc. functions which do this returning of the fee budget, and then forward the call to whatever was the previous fail-handler for that RPC (payment_rpc_failure for createonion and sendonion, payment_waitsendpay_finished for waitsendpay).

The above should basically allocate budgets to each sub-payment as "tightly" as possible, and means that even on my badly-connected node, it should be able to acquire payment solutions of 0.6% with just maxfeepercent=2 or even maxfeepercent=1.

To be clear, I am not working on this for now, so anyone who is willing to poke into plugins/libplugin-pay.c can go try to implement the above. Not sure if this is a good first issue, but hey, just try, mkay?

While I see the rationale behind the proposal I don't think it actually improves our success rate, and may in fact be detrimental, leading to payments not completing due to a one-sided assignment of fees to a partial payment, leaving other parts with insufficient budget.

I'm convinced that we need to make sure that each attempt/part is assigned the minimal fee budget possible. If we do not do this, we may end up with an early part using the majority of the fee budget, leaving little to no fees for other parts. The payment as a whole would be doomed, since the fee-using part will only get revoked by the MPP timeout, and outstanding parts don't stand a chance, because of insufficient budget left over.

It is for this reason that I proposed the fee-stealing solution, in which we have a heuristic that tells us if an attempt reached the destination, and therefore its remaining fee budget can be redistributed to the other attempts to boost their chances of success. Think of it as the counterproposal to this, in that we start from the most constrained version, loosening the constraints as we learn about the network. This ensures minimal interference between parts of a payment.

Your proposal on the other hand starts with the loosest constraints, tightening them as we go along, and potentially leading us into a dead-end that we can't escape from. This in turn is a direct result of parts sharing a fee budget.

My mental model is that earlier sub-parts of the payment are more likely to try more direct routes due to having less countermanding information that the direct routes are not passable. Direct routes are those that have lower fees relative to the size of the subpayment. Thus, it should also have the behavior that they start out with lower fees in general, then move on to higher and higher fee-routes, because that is how getroute works, it always gives the shortest route available for the given exclusions, and there are fewer exclusions earlier than later.

In particular, from my observation from my badly-connected node: most of the time the budget allocation to each sub-payment you fire out is almost entirely not used. Many of the sub-payments that reach the destination ultimately take up a tiny amount of their allocated budget used in fees, while later, smaller sub-payments are strapped for fees and are failed due to lack of fees anyway.

With a single global budget, the earlier sub-payments that do reach the destination deduct only a small fee, because they tend to be on the shorter and cheaper routes, but later payments have to contend with a network that already has funds locked up on those routes. So this just dynamically gives the earlier payments with more direct routes less fee budget and the later routes (that now have to live in a network with some payments locked up by your earlier sub-payments) get more fee budget.

Thus, while your theory on budget constraints may seem compelling, I am unconvinced that it actually holds up to how the network works. Indeed, the necessity of stealing budget suggests that the initial fee allocation is just wrong --- if your initial allocation of the fee budget had been correct in the first place, stealing budget would not even be necessary.

Your approach is like allocating memory directly to each application in proportion to the size of its executable file, instead of just allocating memory dynamically when the application actually runs. Then you have to fix it up later when one of the applications turns out to take up a lot of memory relative to its code, and steal some unused memory allocated for another application that ended up not using it. I am unconvinced that is the better way to allocate resources.

So let us simulate.

Suppose we have a network where there are three viable routes. One costs 1 msat, the next costs 2 msat, the third costs 3 msat. However, each of the routes has limited capacity, and let us suppose that it can take only exactly one sub-payment.

Now, let us suppose that our presplitter splits the payment into three equal payments, and we have a fee budget of exactly 6 msat.

Under your strategy of "split the budget and steal unused budget later". the three sub-payments will each get 2 msat of budget. The first one tried goes through the cheapest 1msat route fine, with 1 msat of leftover fee budget. The second one tries the cheapest route again, but fails because its capacity was already consumed by the first sub-payment. It tries the second-cheapest route of 2msat, and gets it with exactly the fee budget. The third sub-payment tries the earlier routes as well, but again fails due to capacity limits. Its remaining viable path takes 3 msat, which is 1 msat higher than its preallocated 2msat budget. So it has to go steal budget. It steals the 1 msat budget from the first sub-payment, and pushes through.

Under my strategy of "global budget", the three sub-payments initially get sent over the chepaest 1msat route, and get allocated 1 msat each, leaving 3 msat in the global budget. However, only one of the sub-payments will succeed due to capacity constraints on the network, so the other two fail and return their 1msat budgets back to the global budget, bringing it up to 5 msat. The sub-payments then get tried on the longer 2-msat route, getting 2 msat budget each, bringing the global budget to a measly 1 msat. They are fired on the 2-msat route, but one of them returns due to insufficient capacity. That one returns the 2msat budget back to the global budget which goes back up to 3 msat, then it tries 3-msat route, getting the 3 msat left in the global budget. Then the entire payment pushes through.

Now, the key insight you have to have is this:

In budget stealing, the sub-payment you are stealing from has to accurately record how much of the budget allocated to it, it is not actually using.

Then we can make this optimization:

Instead of keeping the "amount of budget I am not using" with each sub-payment, just make a single variable being "the amount of budgets all previous fired sub-payments are not using".

This is because budget-stealing may need to steal from more than one previous sub-payment as well.

However, having a single budget that represents "the amount of budget all previous fired sub-payments are not using" is simpler.

With budget-stealing, you have to somehow get references to all the other ongoing payments.
- With our current tree structure, that means traversing the tree up to the root, then retraversing downward to all children. That... is not a very efficient operation.
In principle, if you are a sub-payment that is strapped for fee budget, you do not care from whom you steal budget from, you just need to ensure you are getting from the unused part of the budget from all the sub-payments you are stealing from.
- So why should you even be tracking all the other sub-payments and exactly how much budget they are not using individually? Just track the part of the budget that none of the existing sub-payments are currently using. That means single global budget.

Basically --- with budget-stealing, every sub-payment has to maintain a separate field for "this is the pat of the allocated budget I am not using". Instead of having lots of little pools that we have to handle, just make one big pool meaning "this is the part of the total allocated budget that none of us are currently using". The sum of the individual budget-stealable amounts, and the single global budget I am proposing, will be exactly the same anyway, so why complicate the solution unnecessarily? Just do not split up the unused part of the budget into tiny pools that you have to virtually re-merge anyway when you steal, then split back again, like, why would you make more work for yourself?

So yes, I think this is just substantially the same as your budget-stealing idea, except more efficient due to being simpler.

Even with path randomization, where you might speculatively avoid the shortest path in order to mislead evil surveillors, does not require per-part budgets. For every part under consideration, you could select any path at random whose fee is below global_budget * sub_payment_amount / total_amount, then before firing the payment along that path, deduct its fee from the global budget. It will still behave approximately the same (esepcially if you bias towards selecting shorter routes, which you want to do as a general policy to improve your chances of keeping within the total fee budget), but with a much simpler single budget than managing lots of tiny budgets that you have to periodically re-merge when you inevitably have to steal anyway, and then re-split back after stealing.

ElementsProject / lightning

MPP Should Have Single Global Budget #4104