ElementsProject / lightning

Core Lightning — Lightning Network implementation focusing on spec compliance and performance
Other
2.85k stars 902 forks source link

Payment received and imediatelly hit deadline, channel closed #4506

Open fiatjaf opened 3 years ago

fiatjaf commented 3 years ago

Issue and Steps to Reproduce

The logs below mix lines from lightningd and lines from lntxbot to show that the invoice was created (I've removed personal information from the first line, but it's the same invoice), then its payment was received, and then in the same minute the invoice it was paid and then imediatelly marked as "hit deadline" and the channel was closed.

lntxbot 9:10PM DBG generating invoice msats=500000
plugin-lntxbot got HTLC. amount=500000msat short_channel_id=7983001x9532722x54753 hash=7a6ce3c04fdff6efdea2d690cff7e9759e51b68033a50ffa95f7730c1dcf145f
plugin-lntxbot invoice received. we have a preimage: abc659a628de9f5e76e2122e3edf50898f4dde2d4a8b8cf37ae0bc824c5f83a3 - resolve
2021-05-03T19:10:53.454Z UNUSUAL 02f40890af885da4673f0ee9725ee74bb2c66d6491cc4334056a2701057993e61d-chan#5333: Peer permanent failure in CHANNELD_NORMAL: Fulfilled HTLC 1233 SENT_REMOVE_COMMIT cltv 681772 hit deadline
2021-05-03T19:10:53.516Z INFO    02f40890af885da4673f0ee9725ee74bb2c66d6491cc4334056a2701057993e61d-chan#5333: State changed from CHANNELD_NORMAL to AWAITING_UNILATERAL

The weird part is that the logs say there that the CLTV is for block 681772 but at the time that happened we were still in block 681751.

Full invoice: lnbc5u1psfqj4mpp50fkw8sz0mlmwlh4z66gvlalfwk09rd5qxwjsl7547aesc8w0z30sdpvf389qcte9e3k7gzxv96kxet58gsyysjzyprxzatrv46qxq9p5hsqrzjqtqkejjy2c44jrwj08y5ygqtmn8af7vscwnflttzpsgw7tuz9r4077w0nxgh2vk4uyqqqqqqqqqqqqqqpysp5qypqxpq9qcrsszg2pvxq6rs0zqg3yyc5z5tpwxqergd3c8g7rusq9qypqsqug728gtca43f8cxmausee6p55ta6hs72sy8t75tasylcj686hwxp5fuqa2asrjuqdg2d24tt2dr5nhkds6aqlhe6m36lmn03medrmhcpwxgwa5

Logs from the channel peer (he says these happened imediatelly after UpdateFulfillHTLC for that invoice):

2021-05-03 20:10:53.468 [INF] PEER: disconnecting 02c16cca44562b590dd279c942200bdccfd4f990c3a69fad620c10ef2f8228eaff@5.2.67.89:9735, reason: read handler closed
2021-05-03 20:10:53.669 [DBG] SRVR: Notifying that peer 02c16cca44562b590dd279c942200bdccfd4f990c3a69fad620c10ef2f8228eaff@5.2.67.89:9735 is offline
2021-05-03 20:10:53.425 [DBG] PEER: Received CommitSig(chan_id=6a3dfb1e4f73e0bf48b92daf1d2258417d46ea60e7c09124d6671ff4a8673cbd, num_htlcs=0) from 02c16cca44562b590dd279c942200bdccfd4f990c3a69fad620c10ef2f8228eaff@5.2.67.89:9735
2021-05-03 20:10:54.732 [DBG] BTCN: Connected to 02c16cca44562b590dd279c942200bdccfd4f990c3a69fad620c10ef2f8228eaff@5.2.67.89:9735 (reqid 2329)

getinfo output

{
   "id": "02c16cca44562b590dd279c942200bdccfd4f990c3a69fad620c10ef2f8228eaff",
   "alias": "@lntxbot",
   "color": "296683",
   "num_peers": 31,
   "num_pending_channels": 1,
   "num_active_channels": 27,
   "num_inactive_channels": 1,
   "address": [
      {
         "type": "ipv4",
         "address": "5.2.67.89",
         "port": 9735
      }
   ],
   "binding": [
      {
         "type": "ipv4",
         "address": "0.0.0.0",
         "port": 9735
      }
   ],
   "version": "v0.10.0-38-gb366453",
   "blockheight": 681771,
   "network": "bitcoin",
   "msatoshi_fees_collected": 33130087,
   "fees_collected_msat": "33130087msat",
   "lightning-dir": "/home/fiatjaf/.lightning1/bitcoin"
}
fiatjaf commented 3 years ago

Another channel closed because of that.

{
  "timestamp": "2021-05-06T05:38:13.939Z",
  "old_state": "CHANNELD_NORMAL",
  "new_state": "AWAITING_UNILATERAL",
  "cause": "protocol",
  "message": "Fulfilled HTLC 83 SENT_REMOVE_COMMIT cltv 682227 hit deadline"
}

This is in GMT+2 time, which means this happened 10 hours and 13 minutes before the actual block 682227: https://blockstream.info/block/0000000000000000000babe2f64bc91a482e0b3f17326c68a6759a2904852b3e

cdecker commented 3 years ago

In order for the HTLC to be correctly settled on-chain following a closure we must have to have the HTLC output confirm on-chain before the timeout actually expires (notice the different wording between deadline and timeout). This is why the closure is initiated before the timeout expires.

The problem here seems to be that you are using an invoice going to a virtual node behind your own node and terminating it via the htlc_accepted hook, which is ok, however the way you indicate that the sender should route through you is via the route hints, which only asks for 9 blocks in CLTV delta. This afaik is less than the timeout - deadline difference we require to safely keep the HTLC pending, forcing us to close the channel. If this were a payment destined for us, then lightningd could know that we'll settle it right away, and don't have to go on-chain right this moment, but the external handling through the plugin prevents that, and normal forwarding rules are applied.

TL;DR: increase the CLTV delta in the route hints to something that is larger than the timeout -deadline difference.

fiatjaf commented 3 years ago

How much is the timeout-deadline difference? I thought this could be related after the first time I spotted this issue, I've then increased it to 40 but the issue keeps happening. Gonna put it as 144 now.

Still I don't get the full rationale for this: shouldn't those 9 (or 40) be added to the time required by lightningd itself? Since that extra hop doesn't exist I thought it could even have a CLTV delta of zero.

fiatjaf commented 3 years ago

https://github.com/ElementsProject/lightning/blob/master/lightningd/peer_htlcs.c#L2306-L2319

From this shouldn't it follow that for a CLTV delta of 9 it should wait until there are 5 blocks left to close the channel? And with 40 that should be 20?

So if I increase the CLTV delta it will close channels sooner?

In the examples above isn't it closing the channels like 600 blocks before the timeout? I'm confused.

cdecker commented 3 years ago

I'll need to reproduce this in a test, but it seems a bit strange indeed. It's 60 blocks in 10 hours by the way, not 600 🙂

fiatjaf commented 3 years ago

Oops.

cdecker commented 3 years ago

It seems Rusty has found a corner case in our state machine, in which we don't cancel a timeout timer despite it not being required anymore. Will keep you posted :+1:

rustyrussell commented 3 years ago

No, that's different: we kill channeld for that, not the connection.

Here's the code you're hitting:

/* BOLT #2:
 *
 * 3. the deadline for received HTLCs this node has fulfilled: the deadline
 * after which the channel has to be failed and the HTLC fulfilled on-chain
 * before its `cltv_expiry`. See steps 4-7 above, which imply a deadline of
 * `2R+G+S` blocks before `cltv_expiry`: 18 blocks is reasonable.
 */
/* We approximate this, by using half the cltv_expiry_delta (3R+2G+2S),
 * rounded up. */
static u32 htlc_in_deadline(const struct lightningd *ld,
                const struct htlc_in *hin)
{
    return hin->cltv_expiry - (ld->config.cltv_expiry_delta + 1)/2;
}

By default, this config is 34, meaning we get upset 18 blocks before the HTLC expires. But note that this is also our default final_htlc expiry, which I think leaves us with a race if a block comes in.

@cdecker offered to test, so I'll leave that to him.

I need to go back to the BOLT 2 math, I think that we should allow at least 3 blocks in this case (implying that we use min((ld->config.cltv_expiry_delta + 1)/2, ld->config.final_htlc_expiry - 3), and insist that config.final_htlc_expiry be > 3?

fiatjaf commented 3 years ago
cltv-delta=144                                                                            
cltv-final=144

I have these in my settings. I now realize I set those without knowing what they actually did.

Maybe the problem was that I was generating my own invoices with a CLTV smaller than 144 manually? But then the HTLC should be have been rejected anyway if the CLTV delta was smaller than 144, right?

rustyrussell commented 3 years ago

Huh, yeah. We should reject such payments.

And those setting remove my theory though. That gives you 71 blocks before we will consider the deadline. Hmm...

cdecker commented 3 years ago

I have spent a couple of hours to try and track this one down, but was unable to reproduce it. Here's my test:

@unittest.skipIf(not DEVELOPER, "Needs quick bitcoind sync")
def test_htlc_deadline_close(node_factory, executor, bitcoind):
    """Reproduces #4506: `htlc_accepted` returns but channel gets killed.
    """
    # Use the hold_htlcs.py plugin to hold while we generate blocks,
    # closing in on the deadline.
    diff1 = 9
    diff2 = 1

    plugin = os.path.join(os.path.dirname(__file__), 'plugins/shortcircuit.py')
    l1, l2, l3, l4 = node_factory.line_graph(4, opts=[
        {},
        {},
        {'plugin': plugin},  # l3 accepts on behalf of l4
        {},  # l4 is just the invoice creator, will be shut off afterwards
    ], wait_for_announce=True)
    blockheight = l1.rpc.getinfo()['blockheight']

    ph = l4.rpc.invoice('1000sat', 'lbl', 'desc', preimage="00" * 32)['payment_hash']
    l4.stop()  # We don't need the real recipient anymore.

    def no_new_block(*args, **kwargs):
        return {
            "error": {
                "code": -8,
                "message": "Block height out of range"
            }
        }

    # Pin l1 at height 108
    l1.daemon.rpcproxy.mock_rpc('getblockhash', no_new_block)

    # Increase the blockheight diff between l1 and l2, then pin l2
    bitcoind.generate_block(diff1)
    l2.daemon.wait_for_log(r'Adding block {}'.format(blockheight + diff1))
    l2.daemon.rpcproxy.mock_rpc('getblockhash', no_new_block)

    # Finally increase the blockheight diff between l2 and l3
    bitcoind.generate_block(diff2)
    l3.daemon.wait_for_log(r'Adding block {}'.format(blockheight + diff1 + diff2))

    # Now we have the full setup, start the payment that should be
    # passed to the plugin, and hopefully kills the channel even
    # though we resolve immediately.
    route = l1.rpc.getroute(l4.info['id'], '1000sat', 100)['route']
    from pprint import pprint
    pprint(route)

    l1.rpc.sendpay(route, ph)
    l1.rpc.waitsendpay(ph)

    # l2 and l3 should be happy enough
    assert not l2.daemon.is_in_log(r'Expiry cltv [0-9]+ too close to current [0-9]+')
    assert not l3.daemon.is_in_log(r'Expiry cltv [0-9]+ too close to current [0-9]+')

    # All channels should still be active
    assert l3.rpc.listpeers()['peers'][0]['channels'][0]['state'] == 'CHANNELD_NORMAL'
    assert [p['channels'][0]['state'] for p in l2.rpc.listpeers()['peers']] == ['CHANNELD_NORMAL'] * 2
    assert l1.rpc.listpeers()['peers'][0]['channels'][0]['state'] == 'CHANNELD_NORMAL'

The intuition being that some blockheight difference is to be blamed on the immediate kill of the channel. While there are some subtleties in how the test behaves with vs. without the plugin in the line_graph call, they are only due to the fact that we don't really forward with the plugin, otherwise they behave identically.

You can check out the behavior by modifying diff1 and diff2 which are the difference in blockheight between (l1, l2) and (l2, l3). As long as diff1 < 16 we're still good with plugin, without plugin diff1 + diff2 < 9 since that's the last hop CLTV delta.

The only thing that this doesn't check is when a block is processed by l3, crossing the threshold, while the HTLC is being handled by l3. Presumably the block being processed would trigger the deadline. So I'll try to see if we can make that happen as well.

cdecker commented 3 years ago

Was't able to reproduce this even when pulling the blockheights up to the maximum after which each hop would return expiry too soon, and then generating blocks while we are holding onto the HTLC. What I did manage is to get l2 (the hop forwarding the HTLC to the node that is terminating the payment with a plugin) to kill the channel if I allow it to process the new blocks:

Offered HTLC 0 SENT_ADD_ACK_REVOCATION cltv 124 hit deadline

Notice that this is different from the OP issue:

Fulfilled HTLC 1233 SENT_REMOVE_COMMIT cltv 681772 hit deadline

Only thing that I can imagine is that we were even more unlucky and the block arrived at the node between it marking it as fulfilled internally and committing the change.

This is the test code I used to replicate the above:

@unittest.skipIf(not DEVELOPER, "Needs quick bitcoind sync")
def test_htlc_deadline_close(node_factory, executor, bitcoind):
    """Reproduces #4506: `htlc_accepted` returns but channel gets killed.
    """
    # Use the hold_htlcs.py plugin to hold while we generate blocks,
    # closing in on the deadline.
    diff1 = 6
    diff2 = 0
    diff3 = 9

    terminate = os.path.join(os.path.dirname(__file__), 'plugins/shortcircuit.py')
    hold = os.path.join(os.path.dirname(__file__), 'plugins/hold_htlcs.py')
    l1, l2, l3, l4 = node_factory.line_graph(4, wait_for_announce=True)
    l3.rpc.plugin('start', hold)
    l3.rpc.plugin('start', terminate)

    blockheight = l1.rpc.getinfo()['blockheight']

    ph = l4.rpc.invoice('1000sat', 'lbl', 'desc', preimage="00" * 32)['payment_hash']

    def no_new_block(*args, **kwargs):
        return {
            "error": {
                "code": -8,
                "message": "Block height out of range"
            }
        }

    # Pin l1 at height 108
    l1.daemon.rpcproxy.mock_rpc('getblockhash', no_new_block)

    # Last hop is most restrictive so set that first
    bitcoind.generate_block(diff3)
    l4.daemon.wait_for_log(r'Adding block {}'.format(blockheight + diff3))
    l4.daemon.rpcproxy.mock_rpc('getblockhash', no_new_block)

    bitcoind.generate_block(diff2)
    l3.daemon.wait_for_log(r'Adding block {}'.format(blockheight + diff3 + diff2))
    l3.daemon.rpcproxy.mock_rpc('getblockhash', no_new_block)

    # Increase the blockheight diff between l1 and l2, then pin l2
    bitcoind.generate_block(diff1)
    l2.daemon.wait_for_log(r'Adding block {}'.format(blockheight + diff3 + diff2 + diff1))
    l2.daemon.rpcproxy.mock_rpc('getblockhash', no_new_block)

    # Now we have the full setup, start the payment that should be
    # passed to the plugin, and hopefully kills the channel even
    # though we resolve immediately.
    route = l1.rpc.getroute(l4.info['id'], '1000sat', 100)['route']
    from pprint import pprint
    pprint(route)

    l1.rpc.sendpay(route, ph)

    l3.daemon.wait_for_log(r'Calling htlc_accepted hook of plugin hold_htlcs.py')
    bitcoind.generate_block(10)
    l2.daemon.rpcproxy.mock_rpc('getblockhash', None)  # Unpin l2 so it catches up with the chain again
    l3.daemon.rpcproxy.mock_rpc('getblockhash', None)  # Unpin l3 so it catches up with the chain again
    l2.daemon.wait_for_log(r'Adding block')
    l3.daemon.wait_for_log(r'Adding block')
    l3.daemon.wait_for_log(r'Plugin shortcircuit.py returned on htlc_accepted hook call')

    l1.rpc.waitsendpay(ph)

    # l2 and l3 should be happy enough
    assert not l2.daemon.is_in_log(r'Expiry cltv [0-9]+ too close to current [0-9]+')
    assert not l3.daemon.is_in_log(r'Expiry cltv [0-9]+ too close to current [0-9]+')

    # All channels should still be active
    assert l3.rpc.listpeers()['peers'][0]['channels'][0]['state'] == 'CHANNELD_NORMAL'
    assert [p['channels'][0]['state'] for p in l2.rpc.listpeers()['peers']] == ['CHANNELD_NORMAL'] * 2
    assert l1.rpc.listpeers()['peers'][0]['channels'][0]['state'] == 'CHANNELD_NORMAL'
cdecker commented 3 years ago

@rustyrussell & @fiatjaf: any idea what we can try to do next to replicate? It feels like this is extremely unlikely to happen no matter what, and I don't know what safeguards we could potentially put in place to prevent this, that would not end up hurting us more than helping (auto-rejecting HTLCs that are 1 off to the deadline seems heavy-handed, and likely even more confusing to plugin devs).

fiatjaf commented 3 years ago

I don't know, my only guess is that it was related to something I'm doing with the fake route hint like you said first, although I couldn't find an explanation and I think it shouldn't happen.

There's something weird about the fact that when I increased the fake route hint CLTV delay to 144 (the same value I have in my config file for cltv-delta) these things stopped happening.

At some point I intend to try to come up with a nice reproduceable setup like you did, but I can't do it now. Can we just leave things as they are for now until further notice?

cdecker commented 3 years ago

I don't know, my only guess is that it was related to something I'm doing with the fake route hint like you said first, although I couldn't find an explanation and I think it shouldn't happen.

Yeah, I was wondering about that too, so I decided to replicate your setup as faithfully as I could, with l4 acting as the standin for the virtual node you are terminating for (so I wouldn't have to implement fake invoice signing :wink:). Tried both with and without that node, and quite a few parameters, but either the previous node would reject the HTLC or the node would happily accept it, even in really skewed scenarios, i.e., dozens of blocks of difference.

There's something weird about the fact that when I increased the fake route hint CLTV delay to 144 (the same value I have in my config file for cltv-delta) these things stopped happening.

Just guessing here, but it might have decreased the probability of the issue happening by giving nodes some more slack and moving the deadline height further out.

Going to mark this as needinfo for now, so we don't lose track of this.