Closed JssDWt closed 6 months ago
Updated the test to remove some unnecessary noise.
This test appears to fail when the payment size is above half the available channel capacity. So probably the issue here is that the htlc from the first attempt is not yet removed from the channel. When the second attempt is tried, we won't find a route because we don't have the available balance for the payment ourselves. The payment is then split, but split combinations lead to the same balance issues (maybe amplify them?)
I haven't yet found the cause, but this stood out to me.
Some logs say Not using a routehint
Some logs say Using routehint 0382ce59ebf18be7d84677c2e35f23294b9992ceca95491fcf8a56c6cb2d9de199 (111x1x0) cltv_delta=6
Indicating that some parts are not using the the same flow as other parts.
I think the trickiness here is somewhere in the payment modifiers. Trying to get my head around that.
Would it make a difference if dest
opens a channel with randomnode
instead of the other way around and then depleting it?
Would it make a difference if
dest
opens a channel withrandomnode
instead of the other way around and then depleting it?
That makes sense. Made that channel the other way around, and also got rid of the LSP. So it's a 4 nodes setup now. I made the sender wait for the proper gossip.
So it doesn't make a difference, the test still fails.
On the first payment attempt the estimated capacity for the sender's local channel is 175320000msat On the second payment attempt the estimated capacity is 75319800msat (less than the amount we're trying to send)
This makes sense, because the https://github.com/ElementsProject/lightning/blob/4b8d2617bbbc678a67376b32332804b66a691034/plugins/libplugin-pay.c#L2468-L2470.local_channel_hints
payment modifier will call listpeerchannels
and update the local channel hints here
The htlc from the first payment attempt is not yet removed from the channel, so that balance is unusable. This also explains why this is a race, sometimes the test succeeds and sometimes it doesn't. If the htlc is removed from the channel, the payment will succeed. If not, the payment will fail due to insufficient balance.
My suggestion would be to wait a little bit after receiving a payment failure, to make sure the htlc is completely removed from the channel. And only then initiate new payment attempts. Note that if the local balance is an issue, it would definitely make sense to either fail fast and not try any split payments afterwards, or wait for the htlc to be removed from the channel.
Note that this problem may be amplified on nodes that have slow signers, like on greenlight.
Insight:
The estimated capacity is actually added back to the channel hint in time here https://github.com/ElementsProject/lightning/blob/4b8d2617bbbc678a67376b32332804b66a691034/plugins/libplugin-pay.c#L586-L589
When the estimated capacity is checked here later however, it sees the old value https://github.com/ElementsProject/lightning/blob/4b8d2617bbbc678a67376b32332804b66a691034/plugins/libplugin-pay.c#L697
I checked the pointers to the channel_hints
array for the root payment on both places.
When the amount is substracted from the channel hint I'm seeing pointer A
When the amount is added back to the channel hint I'm seeing pointer B
When we're doing the pathfinding I'm seeing pointer A again.
So the amount was added back to a copy of the channel_hints array?
Issue and Steps to Reproduce
I suspect the issue is a race somewhere in the pay plugin, which causes retry attempts to not find a route. Here's a test of a specific scenario we encountered recently. This test sometimes fails, sometimes succeeds. This is a setup similar of the architecture at Breez.
The sender connected to a router, the router is connected to the destination. This is the path we'd like to find. The router, however, is also connected to a random other node, which is connected to the destination. This path has low fees, so will be initially preferred. The payment will fail to find the second route.
log output
See here that after the first failed attempt, the second part immediately fails to find a route. The payment is then split and split, but the second try really should have been able to find a route over one of the routing nodes