fail to force close channel when received SCB from LND who keeps terminating connection because they don't understand our message

btweenthebars commented 1 year ago

Issue and Steps to Reproduce

I have described the incident here on LND repo

It looks to me that it's half our fault not force closing the channel when we know that the revocation numbers don't match. This is shown on the listpeers:

               "state_changes": [
                  {
                     "timestamp": "2022-11-06 08:47:17.578Z",
                     "old_state": "CHANNELD_AWAITING_LOCKIN",
                     "new_state": "CHANNELD_NORMAL",
                     "cause": "remote",
                     "message": "Lockin complete"
                  }
               ],
               "status": [
                  "CHANNELD_NORMAL:bad reestablish revocation_number: 477 vs 483",
                  "CHANNELD_NORMAL:Will attempt reconnect in 300 seconds"
               ],

I was able to force close it manually

`getinfo` output

"version": "v22.11.1"

cdecker commented 1 year ago

There was a change in the ln spec a while ago, in which Eclair asked us not to force close automatically if we detect they're lagging behind (allowing for a last ditch attempt at maybe restoring a more up to date backup).

So I'm tempted to say that we're correct and that we should close only when asked further.

cdecker commented 1 year ago

Found the LN spec issue here: https://github.com/lightning/bolts/issues/934 and the local tracking issue here https://github.com/ElementsProject/lightning/issues/4927

This suggests that lnd should be sending an error in reply to our warning so we know we should be force closing.

whitslack commented 1 year ago

Yep. I've been contacted several times now (thrice, I think) by node operators who have restored a static channel backup and don't understand why my node is not force-closing when they connect to it and present an outdated channel_reestablish. I've been pointing them to lightning/bolts#934 and telling them to configure their nodes to send an error message to tell my node to force-close — or to file a bug report if LND provides them no way to do that.

Kudos for implementing the spec! Maybe LND will too one of these days. 😉

Roasbeef commented 1 year ago

Hey @cdecker I checked out that link, but is isn't actually a concrete spec change, instead it's an issue outlining an alternative interpretation and amendment to the spec. When it was brought up in the spec meeting, I gave pause as I envisioned scenarios like the ones our users are running into today, wherein they attempt to recover their funds, but end up in a stuck situation (potentially losing to suspended/lost funds).

Rolling out the change proposed change also wouldn't be super easy as it would be new clients modifying their behavior from the PoV of old clients that rely on such behavior. If we want to consider amending the spec, then I think we should do so via a new feature bit, something like lax_channel_reestablish or w/e. That would allow new nodes to properly gate their behavior, instead of modifying behavior w/o any sort of externally visible signal.

Roasbeef commented 1 year ago

Ok, so I was wrong above: there was a spec change to stop closing as soon as you get a bad channel_reestablish message, but peers should still continue to close when an error is received. Today it appears that because CLN will tear down the connection, the sent error never gets to the processing logic, so the channel is never force closed.

We dug a bit into the changes in the past release and found this: https://github.com/ElementsProject/lightning/commit/b698a5a5ef4acb8d90aec0f6dccc52fb6c37afb2#diff-b1299b24697159a06e7b426c5086c36df8e48db15739a0e301d53a9a21149722R3005-R3006. From my reading, before CLN would send an error, then force close. With the above change, CLN will now send an a warning, then close the connection. lnd nodes have always sent an error during the recovery flow when we know we can't continue at all, which worked before since other peers would then force close either off that bad chan reest message, or the error itself.

IIUC, the impact of the above change is that:

CLN will now send a warning and close the connection
This means that CLN will never get the error message that prompts the peer to close
lnd will reconnect, send the chan reest message, then the error again, which begins the loop all over again

I think if behavior is changed to send the warning, but not disconnect (so the error can be sent), then things would work out again.

btweenthebars commented 1 year ago

Yep. I've been contacted several times now (thrice, I think) by node operators who have restored a static channel backup and don't understand why my node is not force-closing when they connect to it and present an outdated channel_reestablish. I've been pointing them to lightning/bolts#934 and telling them to configure their nodes to send an error message to tell my node to force-close — or to file a bug report if LND provides them no way to do that.

Kudos for implementing the spec! Maybe LND will too one of these days. 😉

Hi, if you get in touch with LND peers having the issue. Can you have them try this: https://github.com/lightningnetwork/lnd/issues/7301#issuecomment-1416029023 ? LND dev wants to know if it work.

whitslack commented 1 year ago

Hi, if you get in touch with LND peers having the issue. Can you have them try this: lightningnetwork/lnd#7301 (comment) ? LND dev wants to know if it work.

Uhh, okay, I will, but damn, that's way too complicated for 99% of users, who are not developers.

guggero commented 1 year ago

Uhh, okay, I will, but damn, that's way too complicated for 99% of users, who are not developers.

Given that the solution linked to above might be too complicated and a problem for users running Umbrel and other node packaging solutions where running a custom version of lnd is not easy, I created a specific new command in chantools to solve this. Tested locally against CLN v22.11.1.

Steps (alternative to running the custom version of lnd mentioned here ):

Download the latest version of chantools (v0.10.7) (can be any machine, use the darwin release for MacOS).

Run the following command (replacing the peer address with the full address of the remote peer and the channel point with the channel in question):

chantools triggerforceclose --peer 039d9c531877266e190af2d51795b6104a1e91a188dad4b9ccf3ae4154075dd8e3@localhost:9747 --channel_point 8e3efb6f5537d964c1c02c0587f75fc6995888675083400d4b10c02f47ad3592:0

The tool will ask you for your seed, which is required to initiate the encrypted p2p communication with the peer. If this doesn't work for your risk profile, you might be better suited with running the custom version of lnd linked above on a separate machine just to trigger the force close.
If the channel is observed by lnd (e.g. showing up in the list of lncli pendingchannels), then just wait until the force close transaction is confirmed and the channel should automatically update the status there.
If the channel isn't showing up in the list of pending channels, you can use the chantools sweepremoteclosed command to sweep the funds.

rustyrussell commented 1 year ago

Indeed, it looks like we issue warning then we take our toys and go home.

Let's see if I can reproduce, then fix. We need to keep listening, in case there's an error...

ElementsProject / lightning