ElementsProject / lightning

Core Lightning — Lightning Network implementation focusing on spec compliance and performance
Other
2.82k stars 895 forks source link

fail to force close channel when received SCB from LND who keeps terminating connection because they don't understand our message #5876

Open btweenthebars opened 1 year ago

btweenthebars commented 1 year ago

Issue and Steps to Reproduce

I have described the incident here on LND repo

It looks to me that it's half our fault not force closing the channel when we know that the revocation numbers don't match. This is shown on the listpeers:

               "state_changes": [
                  {
                     "timestamp": "2022-11-06 08:47:17.578Z",
                     "old_state": "CHANNELD_AWAITING_LOCKIN",
                     "new_state": "CHANNELD_NORMAL",
                     "cause": "remote",
                     "message": "Lockin complete"
                  }
               ],
               "status": [
                  "CHANNELD_NORMAL:bad reestablish revocation_number: 477 vs 483",
                  "CHANNELD_NORMAL:Will attempt reconnect in 300 seconds"
               ],

I was able to force close it manually

getinfo output

"version": "v22.11.1"

cdecker commented 1 year ago

There was a change in the ln spec a while ago, in which Eclair asked us not to force close automatically if we detect they're lagging behind (allowing for a last ditch attempt at maybe restoring a more up to date backup).

So I'm tempted to say that we're correct and that we should close only when asked further.

cdecker commented 1 year ago

Found the LN spec issue here: https://github.com/lightning/bolts/issues/934 and the local tracking issue here https://github.com/ElementsProject/lightning/issues/4927

This suggests that lnd should be sending an error in reply to our warning so we know we should be force closing.

whitslack commented 1 year ago

Yep. I've been contacted several times now (thrice, I think) by node operators who have restored a static channel backup and don't understand why my node is not force-closing when they connect to it and present an outdated channel_reestablish. I've been pointing them to lightning/bolts#934 and telling them to configure their nodes to send an error message to tell my node to force-close — or to file a bug report if LND provides them no way to do that.

Kudos for implementing the spec! Maybe LND will too one of these days. 😉

Roasbeef commented 1 year ago

Hey @cdecker I checked out that link, but is isn't actually a concrete spec change, instead it's an issue outlining an alternative interpretation and amendment to the spec. When it was brought up in the spec meeting, I gave pause as I envisioned scenarios like the ones our users are running into today, wherein they attempt to recover their funds, but end up in a stuck situation (potentially losing to suspended/lost funds).

Rolling out the change proposed change also wouldn't be super easy as it would be new clients modifying their behavior from the PoV of old clients that rely on such behavior. If we want to consider amending the spec, then I think we should do so via a new feature bit, something like lax_channel_reestablish or w/e. That would allow new nodes to properly gate their behavior, instead of modifying behavior w/o any sort of externally visible signal.

Roasbeef commented 1 year ago

Ok, so I was wrong above: there was a spec change to stop closing as soon as you get a bad channel_reestablish message, but peers should still continue to close when an error is received. Today it appears that because CLN will tear down the connection, the sent error never gets to the processing logic, so the channel is never force closed.

We dug a bit into the changes in the past release and found this: https://github.com/ElementsProject/lightning/commit/b698a5a5ef4acb8d90aec0f6dccc52fb6c37afb2#diff-b1299b24697159a06e7b426c5086c36df8e48db15739a0e301d53a9a21149722R3005-R3006. From my reading, before CLN would send an error, then force close. With the above change, CLN will now send an a warning, then close the connection. lnd nodes have always sent an error during the recovery flow when we know we can't continue at all, which worked before since other peers would then force close either off that bad chan reest message, or the error itself.

IIUC, the impact of the above change is that:

I think if behavior is changed to send the warning, but not disconnect (so the error can be sent), then things would work out again.

btweenthebars commented 1 year ago

Yep. I've been contacted several times now (thrice, I think) by node operators who have restored a static channel backup and don't understand why my node is not force-closing when they connect to it and present an outdated channel_reestablish. I've been pointing them to lightning/bolts#934 and telling them to configure their nodes to send an error message to tell my node to force-close — or to file a bug report if LND provides them no way to do that.

Kudos for implementing the spec! Maybe LND will too one of these days. 😉

Hi, if you get in touch with LND peers having the issue. Can you have them try this: https://github.com/lightningnetwork/lnd/issues/7301#issuecomment-1416029023 ? LND dev wants to know if it work.

whitslack commented 1 year ago

Hi, if you get in touch with LND peers having the issue. Can you have them try this: lightningnetwork/lnd#7301 (comment) ? LND dev wants to know if it work.

Uhh, okay, I will, but damn, that's way too complicated for 99% of users, who are not developers.

guggero commented 1 year ago

Uhh, okay, I will, but damn, that's way too complicated for 99% of users, who are not developers.

Given that the solution linked to above might be too complicated and a problem for users running Umbrel and other node packaging solutions where running a custom version of lnd is not easy, I created a specific new command in chantools to solve this. Tested locally against CLN v22.11.1.

Steps (alternative to running the custom version of lnd mentioned here ):

rustyrussell commented 1 year ago

Indeed, it looks like we issue warning then we take our toys and go home.

Let's see if I can reproduce, then fix. We need to keep listening, in case there's an error...