`fundchannel_cancel` hangs indefinitely on a peer's RBF'd second channel in `CHANNELD_AWAITING_LOCKIN` state with first channel in `CHANNELD_NORMAL` state

whitslack commented 3 weeks ago

Issue and Steps to Reproduce

Have one channel in CHANNELD_NORMAL state with a peer.
Propose to fund a second channel with that same peer using fundchannel_start.
Establish the second channel using fundchannel_complete.
Broadcast the signed funding transaction with a low fee rate.
Propose to fund a channel with some other peer using fundchannel_start.
Construct a transaction to fund the channel with the second peer, spending the same coin(s) as would be spent by the transaction of step 3 but using a higher fee rate.
Establish the channel with the second peer using fundchannel_complete.
Broadcast the second signed funding transaction, which replaces the first transaction.
Wait for confirmation of the second transaction.
Attempt to cancel the funding of the channel with the first peer using fundchannel_cancel.
fundchannel_cancel never returns.

`getinfo` output

This is with CLN 24.08.2.

whitslack commented 3 weeks ago

Might be a red herring. The first peer was connected inbound to me via a Tor exit node. (I have no Tor service.) I disconnected the peer, and it has yet to reconnect, so maybe it went offline but the Tor exit node didn't. Still, I suspect something might not be plumbed up quite right if fundchannel_cancel never notices that the peer isn't responding.

whitslack commented 3 weeks ago

Nope. There is definitely a problem here. The peer cannot maintain a stable connection with me because I keep sending WIRE_ERROR for the second channel and dropping the whole connection.

Here's what I see in the log repeatedly (every minute or so):

2024-11-02T15:54:31.710Z DEBUG   03…55-connectd: Connect IN
2024-11-02T15:54:31.711Z DEBUG   03…55-connectd: peer_out WIRE_INIT
2024-11-02T15:54:31.712Z DEBUG   03…55-connectd: peer_in WIRE_INIT
2024-11-02T15:54:31.712Z DEBUG   03…55-connectd: Handed peer, entering loop
2024-11-02T15:54:31.712Z DEBUG   03…55-chan#255536: Peer has reconnected, state CHANNELD_NORMAL: connecting subd
2024-11-02T15:54:31.734Z DEBUG   03…55-channeld-chan#255536: pid 14729, msgfd 188
2024-11-02T15:54:31.735Z DEBUG   03…55-chan#255536: Already have funding locked in
2024-11-02T15:54:31.735Z DEBUG   03…55-chan#255537: Peer has reconnected, state CHANNELD_AWAITING_LOCKIN: connecting subd
2024-11-02T15:54:31.735Z DEBUG   03…55-chan#255537: Telling connectd to send error 00…2e
2024-11-02T15:54:31.736Z DEBUG   03…55-connectd: peer_out WIRE_GOSSIP_TIMESTAMP_FILTER
2024-11-02T15:54:31.736Z DEBUG   03…55-gossipd: seeker: disabling gossip
2024-11-02T15:54:31.740Z DEBUG   03…55-connectd: peer_out WIRE_ERROR
2024-11-02T15:54:31.740Z DEBUG   03…55-connectd: discard_peer
2024-11-02T15:54:31.740Z DEBUG   03…55-lightningd: peer_disconnect_done
2024-11-02T15:54:31.740Z DEBUG   03…55-lightningd: Already reconnecting
2024-11-02T15:54:31.741Z DEBUG   plugin-funder: Cleaning up inflights for peer id 03…55
2024-11-02T15:54:31.741Z DEBUG   03…55-channeld-chan#255536: option_static_remotekey = 1, option_anchor_outputs = 0 option_anchors_zero_fee_htlc_tx = 1
2024-11-02T15:54:31.741Z DEBUG   03…55-channeld-chan#255536: init LOCAL: remote_per_commit = 03…7e, old_remote_per_commit = 03…73 next_idx_local = 687 next_idx_remote = 688 revocations_received = 687 feerates { SENT_ADD_ACK_REVOCATION:1255 } range 253-12700 blockheights { SENT_ADD_ACK_REVOCATION:0 }, our current 868565
2024-11-02T15:54:31.742Z DEBUG   03…55-hsmd: Got WIRE_HSMD_GET_PER_COMMITMENT_POINT
2024-11-02T15:54:31.742Z DEBUG   03…55-hsmd: Got WIRE_HSMD_GET_PER_COMMITMENT_POINT
2024-11-02T15:54:31.742Z DEBUG   03…55-channeld-chan#255536: peer_out WIRE_CHANNEL_REESTABLISH
2024-11-02T15:54:31.743Z DEBUG   03…55-channeld-chan#255536: billboard: Sent reestablish, waiting for theirs
2024-11-02T15:54:36.737Z DEBUG   03…55-connectd: Subd did not close, forcing close
2024-11-02T15:54:36.737Z **BROKEN** 03…55-connectd: Peer did not close, forcing close
2024-11-02T15:54:36.737Z INFO    03…55-channeld-chan#255536: Peer connection lost
2024-11-02T15:54:36.737Z DEBUG   03…55-channeld-chan#255536: Status closed, but not exited. Killing
2024-11-02T15:54:36.737Z INFO    03…55-chan#255536: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)

Note that there are two different channel IDs here. 255536 is the first channel, which is in CHANNELD_NORMAL state. 255537 is the second channel, which is in CHANNELD_AWAITING_LOCKIN state and would be indefinitely since its funding transaction got replaced (intentionally).

Why does connectd disconnect the peer after sending WIRE_ERROR when there is still a perfectly good channel running on the connection?

Epilogue: Interestingly, my long-hung fundchannel_cancel finally returned when I gave up and called dev-forget-channel. Unfortunately, I suspect this means that nobody ended up telling the peer to abandon the channel too.

{
   "code": -1,
   "message": "Channel structure was freed!"
}

ElementsProject / lightning