Open fiatjaf opened 3 years ago
Another channel closed because of that.
{
"timestamp": "2021-05-06T05:38:13.939Z",
"old_state": "CHANNELD_NORMAL",
"new_state": "AWAITING_UNILATERAL",
"cause": "protocol",
"message": "Fulfilled HTLC 83 SENT_REMOVE_COMMIT cltv 682227 hit deadline"
}
This is in GMT+2 time, which means this happened 10 hours and 13 minutes before the actual block 682227: https://blockstream.info/block/0000000000000000000babe2f64bc91a482e0b3f17326c68a6759a2904852b3e
In order for the HTLC to be correctly settled on-chain following a closure we must have to have the HTLC output confirm on-chain before the timeout actually expires (notice the different wording between deadline and timeout). This is why the closure is initiated before the timeout expires.
The problem here seems to be that you are using an invoice going to a virtual node behind your own node and terminating it via the htlc_accepted
hook, which is ok, however the way you indicate that the sender should route through you is via the route hints, which only asks for 9 blocks in CLTV delta. This afaik is less than the timeout - deadline difference we require to safely keep the HTLC pending, forcing us to close the channel. If this were a payment destined for us, then lightningd could know that we'll settle it right away, and don't have to go on-chain right this moment, but the external handling through the plugin prevents that, and normal forwarding rules are applied.
TL;DR: increase the CLTV delta in the route hints to something that is larger than the timeout -deadline difference.
How much is the timeout-deadline difference? I thought this could be related after the first time I spotted this issue, I've then increased it to 40 but the issue keeps happening. Gonna put it as 144 now.
Still I don't get the full rationale for this: shouldn't those 9 (or 40) be added to the time required by lightningd itself? Since that extra hop doesn't exist I thought it could even have a CLTV delta of zero.
https://github.com/ElementsProject/lightning/blob/master/lightningd/peer_htlcs.c#L2306-L2319
From this shouldn't it follow that for a CLTV delta of 9 it should wait until there are 5 blocks left to close the channel? And with 40 that should be 20?
So if I increase the CLTV delta it will close channels sooner?
In the examples above isn't it closing the channels like 600 blocks before the timeout? I'm confused.
I'll need to reproduce this in a test, but it seems a bit strange indeed. It's 60 blocks in 10 hours by the way, not 600 🙂
Oops.
It seems Rusty has found a corner case in our state machine, in which we don't cancel a timeout timer despite it not being required anymore. Will keep you posted :+1:
No, that's different: we kill channeld for that, not the connection.
Here's the code you're hitting:
/* BOLT #2:
*
* 3. the deadline for received HTLCs this node has fulfilled: the deadline
* after which the channel has to be failed and the HTLC fulfilled on-chain
* before its `cltv_expiry`. See steps 4-7 above, which imply a deadline of
* `2R+G+S` blocks before `cltv_expiry`: 18 blocks is reasonable.
*/
/* We approximate this, by using half the cltv_expiry_delta (3R+2G+2S),
* rounded up. */
static u32 htlc_in_deadline(const struct lightningd *ld,
const struct htlc_in *hin)
{
return hin->cltv_expiry - (ld->config.cltv_expiry_delta + 1)/2;
}
By default, this config is 34, meaning we get upset 18 blocks before the HTLC expires. But note that this is also our default final_htlc expiry, which I think leaves us with a race if a block comes in.
@cdecker offered to test, so I'll leave that to him.
I need to go back to the BOLT 2 math, I think that we should allow at least 3 blocks in this case (implying that we use min((ld->config.cltv_expiry_delta + 1)/2, ld->config.final_htlc_expiry - 3), and insist that config.final_htlc_expiry be > 3?
cltv-delta=144
cltv-final=144
I have these in my settings. I now realize I set those without knowing what they actually did.
Maybe the problem was that I was generating my own invoices with a CLTV smaller than 144 manually? But then the HTLC should be have been rejected anyway if the CLTV delta was smaller than 144, right?
Huh, yeah. We should reject such payments.
And those setting remove my theory though. That gives you 71 blocks before we will consider the deadline. Hmm...
I have spent a couple of hours to try and track this one down, but was unable to reproduce it. Here's my test:
@unittest.skipIf(not DEVELOPER, "Needs quick bitcoind sync")
def test_htlc_deadline_close(node_factory, executor, bitcoind):
"""Reproduces #4506: `htlc_accepted` returns but channel gets killed.
"""
# Use the hold_htlcs.py plugin to hold while we generate blocks,
# closing in on the deadline.
diff1 = 9
diff2 = 1
plugin = os.path.join(os.path.dirname(__file__), 'plugins/shortcircuit.py')
l1, l2, l3, l4 = node_factory.line_graph(4, opts=[
{},
{},
{'plugin': plugin}, # l3 accepts on behalf of l4
{}, # l4 is just the invoice creator, will be shut off afterwards
], wait_for_announce=True)
blockheight = l1.rpc.getinfo()['blockheight']
ph = l4.rpc.invoice('1000sat', 'lbl', 'desc', preimage="00" * 32)['payment_hash']
l4.stop() # We don't need the real recipient anymore.
def no_new_block(*args, **kwargs):
return {
"error": {
"code": -8,
"message": "Block height out of range"
}
}
# Pin l1 at height 108
l1.daemon.rpcproxy.mock_rpc('getblockhash', no_new_block)
# Increase the blockheight diff between l1 and l2, then pin l2
bitcoind.generate_block(diff1)
l2.daemon.wait_for_log(r'Adding block {}'.format(blockheight + diff1))
l2.daemon.rpcproxy.mock_rpc('getblockhash', no_new_block)
# Finally increase the blockheight diff between l2 and l3
bitcoind.generate_block(diff2)
l3.daemon.wait_for_log(r'Adding block {}'.format(blockheight + diff1 + diff2))
# Now we have the full setup, start the payment that should be
# passed to the plugin, and hopefully kills the channel even
# though we resolve immediately.
route = l1.rpc.getroute(l4.info['id'], '1000sat', 100)['route']
from pprint import pprint
pprint(route)
l1.rpc.sendpay(route, ph)
l1.rpc.waitsendpay(ph)
# l2 and l3 should be happy enough
assert not l2.daemon.is_in_log(r'Expiry cltv [0-9]+ too close to current [0-9]+')
assert not l3.daemon.is_in_log(r'Expiry cltv [0-9]+ too close to current [0-9]+')
# All channels should still be active
assert l3.rpc.listpeers()['peers'][0]['channels'][0]['state'] == 'CHANNELD_NORMAL'
assert [p['channels'][0]['state'] for p in l2.rpc.listpeers()['peers']] == ['CHANNELD_NORMAL'] * 2
assert l1.rpc.listpeers()['peers'][0]['channels'][0]['state'] == 'CHANNELD_NORMAL'
The intuition being that some blockheight difference is to be blamed on the immediate kill of the channel. While there are some subtleties in how the test behaves with vs. without the plugin in the line_graph
call, they are only due to the fact that we don't really forward with the plugin, otherwise they behave identically.
You can check out the behavior by modifying diff1
and diff2
which are the difference in blockheight between (l1, l2)
and (l2, l3)
. As long as diff1 < 16
we're still good with plugin, without plugin diff1 + diff2 < 9
since that's the last hop CLTV delta.
The only thing that this doesn't check is when a block is processed by l3
, crossing the threshold, while the HTLC is being handled by l3
. Presumably the block being processed would trigger the deadline. So I'll try to see if we can make that happen as well.
Was't able to reproduce this even when pulling the blockheights up to the maximum after which each hop would return expiry too soon
, and then generating blocks while we are holding onto the HTLC. What I did manage is to get l2
(the hop forwarding the HTLC to the node that is terminating the payment with a plugin) to kill the channel if I allow it to process the new blocks:
Offered HTLC 0 SENT_ADD_ACK_REVOCATION cltv 124 hit deadline
Notice that this is different from the OP issue:
Fulfilled HTLC 1233 SENT_REMOVE_COMMIT cltv 681772 hit deadline
Only thing that I can imagine is that we were even more unlucky and the block arrived at the node between it marking it as fulfilled internally and committing the change.
This is the test code I used to replicate the above:
@unittest.skipIf(not DEVELOPER, "Needs quick bitcoind sync")
def test_htlc_deadline_close(node_factory, executor, bitcoind):
"""Reproduces #4506: `htlc_accepted` returns but channel gets killed.
"""
# Use the hold_htlcs.py plugin to hold while we generate blocks,
# closing in on the deadline.
diff1 = 6
diff2 = 0
diff3 = 9
terminate = os.path.join(os.path.dirname(__file__), 'plugins/shortcircuit.py')
hold = os.path.join(os.path.dirname(__file__), 'plugins/hold_htlcs.py')
l1, l2, l3, l4 = node_factory.line_graph(4, wait_for_announce=True)
l3.rpc.plugin('start', hold)
l3.rpc.plugin('start', terminate)
blockheight = l1.rpc.getinfo()['blockheight']
ph = l4.rpc.invoice('1000sat', 'lbl', 'desc', preimage="00" * 32)['payment_hash']
def no_new_block(*args, **kwargs):
return {
"error": {
"code": -8,
"message": "Block height out of range"
}
}
# Pin l1 at height 108
l1.daemon.rpcproxy.mock_rpc('getblockhash', no_new_block)
# Last hop is most restrictive so set that first
bitcoind.generate_block(diff3)
l4.daemon.wait_for_log(r'Adding block {}'.format(blockheight + diff3))
l4.daemon.rpcproxy.mock_rpc('getblockhash', no_new_block)
bitcoind.generate_block(diff2)
l3.daemon.wait_for_log(r'Adding block {}'.format(blockheight + diff3 + diff2))
l3.daemon.rpcproxy.mock_rpc('getblockhash', no_new_block)
# Increase the blockheight diff between l1 and l2, then pin l2
bitcoind.generate_block(diff1)
l2.daemon.wait_for_log(r'Adding block {}'.format(blockheight + diff3 + diff2 + diff1))
l2.daemon.rpcproxy.mock_rpc('getblockhash', no_new_block)
# Now we have the full setup, start the payment that should be
# passed to the plugin, and hopefully kills the channel even
# though we resolve immediately.
route = l1.rpc.getroute(l4.info['id'], '1000sat', 100)['route']
from pprint import pprint
pprint(route)
l1.rpc.sendpay(route, ph)
l3.daemon.wait_for_log(r'Calling htlc_accepted hook of plugin hold_htlcs.py')
bitcoind.generate_block(10)
l2.daemon.rpcproxy.mock_rpc('getblockhash', None) # Unpin l2 so it catches up with the chain again
l3.daemon.rpcproxy.mock_rpc('getblockhash', None) # Unpin l3 so it catches up with the chain again
l2.daemon.wait_for_log(r'Adding block')
l3.daemon.wait_for_log(r'Adding block')
l3.daemon.wait_for_log(r'Plugin shortcircuit.py returned on htlc_accepted hook call')
l1.rpc.waitsendpay(ph)
# l2 and l3 should be happy enough
assert not l2.daemon.is_in_log(r'Expiry cltv [0-9]+ too close to current [0-9]+')
assert not l3.daemon.is_in_log(r'Expiry cltv [0-9]+ too close to current [0-9]+')
# All channels should still be active
assert l3.rpc.listpeers()['peers'][0]['channels'][0]['state'] == 'CHANNELD_NORMAL'
assert [p['channels'][0]['state'] for p in l2.rpc.listpeers()['peers']] == ['CHANNELD_NORMAL'] * 2
assert l1.rpc.listpeers()['peers'][0]['channels'][0]['state'] == 'CHANNELD_NORMAL'
@rustyrussell & @fiatjaf: any idea what we can try to do next to replicate? It feels like this is extremely unlikely to happen no matter what, and I don't know what safeguards we could potentially put in place to prevent this, that would not end up hurting us more than helping (auto-rejecting HTLCs that are 1 off to the deadline seems heavy-handed, and likely even more confusing to plugin devs).
I don't know, my only guess is that it was related to something I'm doing with the fake route hint like you said first, although I couldn't find an explanation and I think it shouldn't happen.
There's something weird about the fact that when I increased the fake route hint CLTV delay to 144 (the same value I have in my config file for cltv-delta
) these things stopped happening.
At some point I intend to try to come up with a nice reproduceable setup like you did, but I can't do it now. Can we just leave things as they are for now until further notice?
I don't know, my only guess is that it was related to something I'm doing with the fake route hint like you said first, although I couldn't find an explanation and I think it shouldn't happen.
Yeah, I was wondering about that too, so I decided to replicate your setup as faithfully as I could, with l4
acting as the standin for the virtual node you are terminating for (so I wouldn't have to implement fake invoice signing :wink:). Tried both with and without that node, and quite a few parameters, but either the previous node would reject the HTLC or the node would happily accept it, even in really skewed scenarios, i.e., dozens of blocks of difference.
There's something weird about the fact that when I increased the fake route hint CLTV delay to 144 (the same value I have in my config file for
cltv-delta
) these things stopped happening.
Just guessing here, but it might have decreased the probability of the issue happening by giving nodes some more slack and moving the deadline height further out.
Going to mark this as needinfo
for now, so we don't lose track of this.
Issue and Steps to Reproduce
The logs below mix lines from lightningd and lines from lntxbot to show that the invoice was created (I've removed personal information from the first line, but it's the same invoice), then its payment was received, and then in the same minute the invoice it was paid and then imediatelly marked as "hit deadline" and the channel was closed.
The weird part is that the logs say there that the CLTV is for block 681772 but at the time that happened we were still in block 681751.
Full invoice:
lnbc5u1psfqj4mpp50fkw8sz0mlmwlh4z66gvlalfwk09rd5qxwjsl7547aesc8w0z30sdpvf389qcte9e3k7gzxv96kxet58gsyysjzyprxzatrv46qxq9p5hsqrzjqtqkejjy2c44jrwj08y5ygqtmn8af7vscwnflttzpsgw7tuz9r4077w0nxgh2vk4uyqqqqqqqqqqqqqqpysp5qypqxpq9qcrsszg2pvxq6rs0zqg3yyc5z5tpwxqergd3c8g7rusq9qypqsqug728gtca43f8cxmausee6p55ta6hs72sy8t75tasylcj686hwxp5fuqa2asrjuqdg2d24tt2dr5nhkds6aqlhe6m36lmn03medrmhcpwxgwa5
Logs from the channel peer (he says these happened imediatelly after
UpdateFulfillHTLC
for that invoice):getinfo
output