ElementsProject / lightning

Core Lightning — Lightning Network implementation focusing on spec compliance and performance
Other
2.85k stars 901 forks source link

Document steps to take after emergency.recover #7023

Closed kingpalethe closed 9 months ago

kingpalethe commented 9 months ago

I am testing CLN's emergency.recover workflow.

To test it, I've made a channel from an LND node to a CLN node, and then abandoned the CLN node, keeping only the emergency.recover file....

So this is effectively a test of the emergency.recover workflow.

I've followed the directions here: https://docs.corelightning.org/docs/recovery#recovery

I've set up a new CLN node, with an empty database, and imported the hsm_secret and emergency.recover files.

After I run lightning-cli emergencyrecover, I saw the following logs from lightningd:

2024-01-30T02:39:16.705Z DEBUG   connectd: maybe_free_peer freeing peer!
2024-01-30T02:39:35.269Z DEBUG   0322d0e43b3d92d30ed187f4e101a9a9605c3ee5fc9721e6dac3ce3d7732fbb13e-chan#1: Watching for funding txid: e2f9c9c7e946b7a7c4a7815d450d4ecf142cdb757ed59d1f02bc304f2a9dc1a9
2024-01-30T02:39:35.269Z DEBUG   0322d0e43b3d92d30ed187f4e101a9a9605c3ee5fc9721e6dac3ce3d7732fbb13e-chan#1: Watching for funding txid: e2f9c9c7e946b7a7c4a7815d450d4ecf142cdb757ed59d1f02bc304f2a9dc1a9
2024-01-30

T02:39:35.269Z DEBUG   0322d0e43b3d92d30ed187f4e101a9a9605c3ee5fc9721e6dac3ce3d7732fbb13e-lightningd: Will try reconnect in 1 seconds
2024-01-30T02:39:35.286Z DEBUG   hsmd: Client: Received message 10 from client
2024-01-30T02:39:35.286Z DEBUG   hsmd: Client: Received message 28 from client
2024-01-30T02:39:35.287Z DEBUG   hsmd: Client: Received message 28 from client
2024-01-30T02:39:36.435Z DEBUG   gossipd: REPLY WIRE_GOSSIPD_GET_ADDRS_REPLY with 0 fds

And, on the command line, I got this response from lightning-cli emergencyrecover

{
   "stubs": [
      "a9c19d2a4f30bc021f9dd57e75db2c14cf4e0d455d81a7c4a7b746e9c7c9f9e2"
   ]
}

This looks potentially good, as there was one (1) channel only tracked in the emergency.recover, so seeing one (1) stub here seems to make sense.

I now move on to the last bullet in the directions:

"Wait until your peer force closes the channel and the node would automatically sweep the funds. This could take some time"

I'm stuck here and wishing there was a bit more documentation on what to do from here.

I don't see any evidence (that I can find) on my LND node, to make me think a force-closure is in progress.

I do have this stub:

{
   "stubs": [
      "a9c19d2a4f30bc021f9dd57e75db2c14cf4e0d455d81a7c4a7b746e9c7c9f9e2"
   ]
}

But I'm not sure what I could do with that.

kingpalethe commented 9 months ago

Also -- I have some doubts. I imported the emergency.recover and hsm_secret onto a new node. I did not attempt to load the database from the "dead" node (as there are many warnings against this, and I want to only test the emergency.recover.) But -- I currently have 0 BTC on this node. Do I need BTC on this node in order to write the forced-closure transaction to the blockchain? If so, this maybe could be added as an additional bullet point here: https://docs.corelightning.org/docs/recovery#recovery

Does it matter WHO was the initiator of the original channel? This original channel was initiated by the REMOTE side (my LND node), so in that case, the LND node would pay the forced closure fees?

In the logs I saw 2024-01-30T02:39:35.269Z DEBUG 0322d0e43b3d92d30ed187f4e101a9a9605c3ee5fc9721e6dac3ce3d7732fbb13e-chan#1: Watching for funding txid: e2f9c9c7e946b7a7c4a7815d450d4ecf142cdb757ed59d1f02bc304f2a9dc1a9

This makes me think that CLN is waiting for someone to fund this transaction?

Also, is there a lightning-cli command that could show me forced-closes that are in progress? Would a forced closure initiated by emergencyrecover even be added to the (previously empty) database of my recover node?

kingpalethe commented 9 months ago

I do see here: https://docs.corelightning.org/reference/lightning-emergencyrecover

There is some documentation, but it is extremely minimal.

in regard to the stubs:

RETURN VALUE
On success, an object is returned, containing:

stubs (array of hexs):
Each item is the channel ID of the channel successfully inserted

What does "inserted" mean here?

kingpalethe commented 9 months ago

If anyone else is trying to navigate this emergency.recover workflow, there is a (rather confusing) YouTube video made by the developer of this functionality: https://www.youtube.com/watch?v=T_IsZN2w7zY&t=2033s

Here you can see that he has used lightning-cli listpeers ... and in my case, that command produces this output:

root@PR2:~# lightning-cli listpeers
{
   "peers": [
      {
         "id": "0322d0e43b3d92d30ed187f4e101a9a9605c3ee5fc9721e6dac3ce3d7732fbb13e",
         "connected": false,
         "num_channels": 1,
         "features": "800000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000002000888a52a1",
         "channels": [
            {
               "state": "CHANNELD_NORMAL",
               "feerate": {
                  "perkw": 253,
                  "perkb": 1012
               },
               "short_channel_id": "1x1x1",
               "direction": 0,
               "channel_id": "a9c19d2a4f30bc021f9dd57e75db2c14cf4e0d455d81a7c4a7b746e9c7c9f9e2",
               "funding_txid": "e2f9c9c7e946b7a7c4a7815d450d4ecf142cdb757ed59d1f02bc304f2a9dc1a9",
               "funding_outnum": 0,
               "close_to_addr": "bc1qcqk9sk7f5yx7lymyddkp88xuax88rfgy0ys20m",
               "close_to": "0014c02c585bc9a10def93646b6c139cdce98e71a504",
               "private": true,
               "opener": "local",
               "alias": {
                  "local": "1x1x1",
                  "remote": "1x1x1"
               },
               "features": [
                  "option_static_remotekey"
               ],
               "funding": {
                  "local_funds_msat": 0,
                  "remote_funds_msat": 200000000,
                  "pushed_msat": 0
               },
               "to_us_msat": 0,
               "min_to_us_msat": 0,
               "max_to_us_msat": 0,
               "total_msat": 200000000,
               "fee_base_msat": 1000,
               "fee_proportional_millionths": 10,
               "dust_limit_msat": 0,
               "max_total_htlc_in_msat": 0,
               "their_reserve_msat": 0,
               "our_reserve_msat": 0,
               "spendable_msat": 0,
               "receivable_msat": 200000000,
               "minimum_htlc_in_msat": 0,
               "minimum_htlc_out_msat": 0,
               "maximum_htlc_out_msat": 0,
               "their_to_self_delay": 0,
               "our_to_self_delay": 0,
               "max_accepted_htlcs": 0,
               "state_changes": [],
               "status": [
                  "CHANNELD_NORMAL:Will attempt reconnect in 300 seconds"
               ],
               "in_payments_offered": 0,
               "in_offered_msat": 0,
               "in_payments_fulfilled": 0,
               "in_fulfilled_msat": 0,
               "out_payments_offered": 0,
               "out_offered_msat": 0,
               "out_payments_fulfilled": 0,
               "out_fulfilled_msat": 0,
               "htlcs": []
            }
         ]
      }
   ]
}

So -- it seems my node DOES have some kind of record of the channel that I am attempting to force-close from my emergency.recover....

But there are lot of things here which create some doubt in my mind....

 "status": [
                  "CHANNELD_NORMAL:Will attempt reconnect in 300 seconds"
               ],
     "connected": false,

Does anyone know how I could get CLN to talk to the LND node, in order to force-close this?

kingpalethe commented 9 months ago

Still fighting with this. It's pretty sketchy that so much about emergency.recover seems to be undocumented. Seems like the kind of thing that n00bs would really need hand-holding with. I am tempted to give up now and just go back to LND-land.

I can tell (from the output of lightning-cli listpeers, shown above) that my node does know about the "stub" (what does that mean?) of the channel recovered from emergency.recover.

I can run this command, to manually connect to the node which opened the channel:

lightning-cli connect 0322d0e43b3d92d30ed187f4e101a9a9605c3ee5fc9721e6dac3ce3d7732fbb13e
{
   "id": "0322d0e43b3d92d30ed187f4e101a9a9605c3ee5fc9721e6dac3ce3d7732fbb13e",
   "features": "800000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000002000888a52a1",
   "direction": "out",
   "address": {
      "type": "ipv4",
      "address": "164.92.106.32",
      "port": 9735
   }
}

I then see this in the CLN logs:

2024-01-30T13:28:02.108Z DEBUG   gossipd: REPLY WIRE_GOSSIPD_GET_ADDRS_REPLY with 0 fds
2024-01-30T13:28:02.132Z DEBUG   0322d0e43b3d92d30ed187f4e101a9a9605c3ee5fc9721e6dac3ce3d7732fbb13e-connectd: Connected out, starting crypto
2024-01-30T13:28:02.165Z DEBUG   028056790768cdd3d415a228a1ee030896cb8858e59002bb1bd71807f768461ce3-hsmd: Got WIRE_HSMD_ECDH_REQ
2024-01-30T13:28:02.166Z DEBUG   hsmd: Client: Received message 1 from client
2024-01-30T13:28:02.166Z DEBUG   0322d0e43b3d92d30ed187f4e101a9a9605c3ee5fc9721e6dac3ce3d7732fbb13e-connectd: Connect OUT
2024-01-30T13:28:02.166Z DEBUG   0322d0e43b3d92d30ed187f4e101a9a9605c3ee5fc9721e6dac3ce3d7732fbb13e-connectd: peer_out WIRE_INIT
2024-01-30T13:28:02.189Z DEBUG   0322d0e43b3d92d30ed187f4e101a9a9605c3ee5fc9721e6dac3ce3d7732fbb13e-connectd: peer_in WIRE_INIT
2024-01-30T13:28:02.189Z DEBUG   lightningd: Calling peer_connected hook of plugin chanbackup
2024-01-30T13:28:02.189Z DEBUG   0322d0e43b3d92d30ed187f4e101a9a9605c3ee5fc9721e6dac3ce3d7732fbb13e-gossipd: seeker: chosen as startup peer
2024-01-30T13:28:02.189Z DEBUG   0322d0e43b3d92d30ed187f4e101a9a9605c3ee5fc9721e6dac3ce3d7732fbb13e-connectd: Handed peer, entering loop
2024-01-30T13:28:02.190Z DEBUG   0322d0e43b3d92d30ed187f4e101a9a9605c3ee5fc9721e6dac3ce3d7732fbb13e-gossipd: seeker: starting gossip
2024-01-30T13:28:02.190Z DEBUG   lightningd: Plugin chanbackup returned from peer_connected hook call
2024-01-30T13:28:02.190Z DEBUG   0322d0e43b3d92d30ed187f4e101a9a9605c3ee5fc9721e6dac3ce3d7732fbb13e-chan#1: Peer has reconnected, state CHANNELD_NORMAL: connecting subd
2024-01-30T13:28:02.190Z DEBUG   0322d0e43b3d92d30ed187f4e101a9a9605c3ee5fc9721e6dac3ce3d7732fbb13e-chan#1: Telling connectd to send error 0011a9c19d2a4f30bc021f9dd57e75db2c14cf4e0d455d81a7c4a7b746e9c7c9f9e2001d57652063616e277420626520746f67657468657220616e796d6f72652e
2024-01-30T13:28:02.190Z DEBUG   0322d0e43b3d92d30ed187f4e101a9a9605c3ee5fc9721e6dac3ce3d7732fbb13e-connectd: peer_out WIRE_GOSSIP_TIMESTAMP_FILTER
2024-01-30T13:28:02.190Z DEBUG   0322d0e43b3d92d30ed187f4e101a9a9605c3ee5fc9721e6dac3ce3d7732fbb13e-connectd: peer_out WIRE_ERROR
2024-01-30T13:28:02.191Z DEBUG   plugin-clnrest.py: Notification: {'id': '0322d0e43b3d92d30ed187f4e101a9a9605c3ee5fc9721e6dac3ce3d7732fbb13e', 'direction': 'out', 'address': {'type': 'ipv4', 'address': '164.92.106.32', 'port': 9735}, 'connect': {'id': '0322d0e43b3d92d30ed187f4e101a9a9605c3ee5fc9721e6dac3ce3d7732fbb13e', 'direction': 'out', 'address': {'type': 'ipv4', 'address': '164.92.106.32', 'port': 9735}}}
2024-01-30T13:28:02.191Z DEBUG   connectd: drain_peer
2024-01-30T13:28:02.191Z DEBUG   0322d0e43b3d92d30ed187f4e101a9a9605c3ee5fc9721e6dac3ce3d7732fbb13e-lightningd: peer_disconnect_done
2024-01-30T13:28:02.191Z DEBUG   0322d0e43b3d92d30ed187f4e101a9a9605c3ee5fc9721e6dac3ce3d7732fbb13e-lightningd: Will try reconnect in 300 seconds
2024-01-30T13:28:02.191Z DEBUG   plugin-funder: Cleaning up inflights for peer id 0322d0e43b3d92d30ed187f4e101a9a9605c3ee5fc9721e6dac3ce3d7732fbb13e
2024-01-30T13:28:02.192Z DEBUG   plugin-clnrest.py: Notification: {'id': '0322d0e43b3d92d30ed187f4e101a9a9605c3ee5fc9721e6dac3ce3d7732fbb13e', 'disconnect': {'id': '0322d0e43b3d92d30ed187f4e101a9a9605c3ee5fc9721e6dac3ce3d7732fbb13e'}}
2024-01-30T13:28:02.219Z DEBUG   connectd: maybe_free_peer freeing peer!
2024-01-30T13:28:06.477Z DEBUG   gossipd: seeker: no peers, waiting

The WIRE_ERROR gives me some doubt -- My best guess here is that CLN is trying to communicate to the opening node (In this case running LND) that this node should force-close the channel, but that communication was not successful?

kingpalethe commented 9 months ago

OK, it does seem that CLN is attempting to communicate with the LND node (the opener of the channel), but there is an error which prevents successful communication. Here are the logs from the LND node.. these appear on the LND node a few milliseconds after I run lightning-cli connect 0322d0e43b3d92d30ed187f4e101a9a9605c3ee5fc9721e6dac3ce3d7732fbb13e on the CLN node :

lnd-1  | 2024-01-30 13:33:41.978 [INF] DISC: GossipSyncer(023361d515b2dcbdf121c59193956b3608175d3bdca60ebd6cbb968b16b5c6c435): applying new update horizon: start=2042-07-14 01:04:00 +0000 UTC, end=2042-07-14 01:04:00 +0000 UTC, backlog_size=0
lnd-1  | 2024-01-30 13:33:43.426 [INF] CRTR: Processed channels=0 updates=126 nodes=10 in last 59.99953644s
lnd-1  | 2024-01-30 13:33:43.557 [INF] SRVR: New inbound connection from 10.8.0.1:56252
lnd-1  | 2024-01-30 13:33:43.557 [INF] SRVR: Finalizing connection to 028056790768cdd3d415a228a1ee030896cb8858e59002bb1bd71807f768461ce3@10.8.0.1:56252, inbound=true
lnd-1  | 2024-01-30 13:33:43.586 [INF] PEER: Peer(028056790768cdd3d415a228a1ee030896cb8858e59002bb1bd71807f768461ce3): Loading ChannelPoint(e2f9c9c7e946b7a7c4a7815d450d4ecf142cdb757ed59d1f02bc304f2a9dc1a9:0), isPending=false
lnd-1  | 2024-01-30 13:33:43.586 [INF] HSWC: ChannelLink(e2f9c9c7e946b7a7c4a7815d450d4ecf142cdb757ed59d1f02bc304f2a9dc1a9:0): starting
lnd-1  | 2024-01-30 13:33:43.586 [INF] HSWC: Trimming open circuits for chan_id=827663:2474:0, start_htlc_id=0
lnd-1  | 2024-01-30 13:33:43.587 [INF] HSWC: Adding live link chan_id=a9c19d2a4f30bc021f9dd57e75db2c14cf4e0d455d81a7c4a7b746e9c7c9f9e2, short_chan_id=827663:2474:0
lnd-1  | 2024-01-30 13:33:43.587 [INF] PEER: Peer(028056790768cdd3d415a228a1ee030896cb8858e59002bb1bd71807f768461ce3): Negotiated chan series queries
lnd-1  | 2024-01-30 13:33:43.587 [INF] NTFN: New block epoch subscription
lnd-1  | 2024-01-30 13:33:43.587 [INF] HSWC: ChannelLink(e2f9c9c7e946b7a7c4a7815d450d4ecf142cdb757ed59d1f02bc304f2a9dc1a9:0): HTLC manager started, bandwidth=189539000 mSAT
lnd-1  | 2024-01-30 13:33:43.587 [INF] HSWC: ChannelLink(e2f9c9c7e946b7a7c4a7815d450d4ecf142cdb757ed59d1f02bc304f2a9dc1a9:0): attempting to re-synchronize
lnd-1  | 2024-01-30 13:33:43.587 [INF] DISC: Creating new GossipSyncer for peer=028056790768cdd3d415a228a1ee030896cb8858e59002bb1bd71807f768461ce3
lnd-1  | 2024-01-30 13:33:43.587 [INF] CNCT: Attempting to update ContractSignals for ChannelPoint(e2f9c9c7e946b7a7c4a7815d450d4ecf142cdb757ed59d1f02bc304f2a9dc1a9:0)
lnd-1  | 2024-01-30 13:33:43.587 [INF] PEER: Peer(028056790768cdd3d415a228a1ee030896cb8858e59002bb1bd71807f768461ce3): unable to read message from peer: read next header: EOF
lnd-1  | 2024-01-30 13:33:43.588 [WRN] HSWC: ChannelLink(e2f9c9c7e946b7a7c4a7815d450d4ecf142cdb757ed59d1f02bc304f2a9dc1a9:0): error when syncing channel states: first message sent to sync should be ChannelReestablish, instead received: *lnwire.Error
lnd-1  | 2024-01-30 13:33:43.588 [INF] PEER: Peer(028056790768cdd3d415a228a1ee030896cb8858e59002bb1bd71807f768461ce3): disconnecting 028056790768cdd3d415a228a1ee030896cb8858e59002bb1bd71807f768461ce3@10.8.0.1:56252, reason: read handler closed
lnd-1  | 2024-01-30 13:33:43.588 [INF] NTFN: Cancelling epoch notification, epoch_id=19031
lnd-1  | 2024-01-30 13:33:43.588 [ERR] HSWC: ChannelLink(e2f9c9c7e946b7a7c4a7815d450d4ecf142cdb757ed59d1f02bc304f2a9dc1a9:0): failing link: unable to synchronize channel states: first message sent to sync should be ChannelReestablish, instead received: *lnwire.Error with error: unable to resume channel, recovery required
lnd-1  | 2024-01-30 13:33:43.588 [INF] HSWC: ChannelLink(e2f9c9c7e946b7a7c4a7815d450d4ecf142cdb757ed59d1f02bc304f2a9dc1a9:0): exited
lnd-1  | 2024-01-30 13:33:44.617 [INF] DISC: GossipSyncer(028056790768cdd3d415a228a1ee030896cb8858e59002bb1bd71807f768461ce3): applying new update horizon: start=1970-01-01 00:00:00 +0000 UTC, end=2106-02-07 06:28:15 +0000 UTC, backlog_size=180761

This is interesting ChannelLink(e2f9c9c7e946b7a7c4a7815d450d4ecf142cdb757ed59d1f02bc304f2a9dc1a9:0): failing link: unable to synchronize channel states: first message sent to sync should be ChannelReestablish, instead received: *lnwire.Error with error: unable to resume channel, recovery required

I'm not sure how to interpret that. Yes, I guess it is true that "recovery" is required... so that seems right... But first message sent to sync should be ChannelReestablish, instead received: *lnwire.Error with error: ... is this some kind of miscommunication between CLN and LND?

kingpalethe commented 9 months ago

I'm closing this. It seems that there is a failure of communication between CLN and LND with regard to the lightning-cli emergencyrecover command. I had previously tested the LND equivalent -- "SCB" https://docs.lightning.engineering/lightning-network-tools/lnd/recovery-planning-for-failure#static-channel-backups-scb --- and that worked well, but that was when using this behavior between two LND nodes, not between a CLN node and LND node.

I'm not sure if the incompatibility CLN's fault or LND's fault. But -- my test indicates that there may some danger for CLN node-runners in the current environment where the majority of nodes are LND -- I think the lesson for less-technical users might be to stick with LND if you think there is a risk you might need to use this emergency.recover workflow.