Closed roeierez closed 1 year ago
At first I thought it might be because of the recent dual-funding spec change, which made the option_anchor_outputs
not be a requirement of option_dual_funding
. This caused some pre-22.11 nodes to drop a connection in case we advertise option_dual_funding
but not option_anchor_outputs
. However in that case we wouldn't get a re-establish, and the disconnect would be immediate, not delayed.
I found the following lines in the raw session log:
2023-02-22T11:51:50+01:00 {} stderr: [2023-02-22T10:51:50Z DEBUG gl_subdaemon::gossipd] Peer has disconnected: 02c811e575be2df47d8b48dab3d3f1c9b0f6e16d0d40b5ed78253308fc2bd7170d
2023-02-22T11:51:50+01:00 {} stderr: [2023-02-22T10:51:50Z ERROR gl_signerproxy::hsmproxy] Connection lost
2023-02-22T11:51:50+01:00 {} stderr: [2023-02-22T10:51:50Z WARN gl_signerproxy::hsmproxy] hsmproxy stopped processing requests with error: processing requests
2023-02-22T11:51:50+01:00 {} stdout: DEBUG 02c811e575be2df47d8b48dab3d3f1c9b0f6e16d0d40b5ed78253308fc2bd7170d-connectd: Last ping unreturned: hanging up
2023-02-22T11:51:50+01:00 {} stdout: INFO 02c811e575be2df47d8b48dab3d3f1c9b0f6e16d0d40b5ed78253308fc2bd7170d-channeld-chan#3: Peer connection lost
2023-02-22T11:51:50+01:00 {} stdout: INFO 02c811e575be2df47d8b48dab3d3f1c9b0f6e16d0d40b5ed78253308fc2bd7170d-chan#3: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)
2023-02-22T11:51:50+01:00 {} stdout: DEBUG 02c811e575be2df47d8b48dab3d3f1c9b0f6e16d0d40b5ed78253308fc2bd7170d-connectd: destroy_subd: 1 subds, to_peer conn (nil), read_to_die = 1
2023-02-22T11:51:50+01:00 {} stdout: DEBUG 02c811e575be2df47d8b48dab3d3f1c9b0f6e16d0d40b5ed78253308fc2bd7170d-connectd: peer_conn_closed
2023-02-22T11:51:50+01:00 {} stdout: DEBUG 02c811e575be2df47d8b48dab3d3f1c9b0f6e16d0d40b5ed78253308fc2bd7170d-lightningd: peer_disconnect_done
This sounds like the peer (LSP?) isn't replying to pings, which is unexpected. Any chance we can corroborate that with logs from the other side?
I'll also attach the raw logs, so we can better relate the two. breez-session.log
We are also hunting down an issue with the VLS signer that may be related: https://gitlab.com/lightning-signer/validating-lightning-signer/-/issues/248
It sounds like the LDK library is validating a field that we aren't considering (sighash flag of the signature) since we must enforce a specific value anyway, so we just overwrite it, rather than validating it. Symptoms are that the signer logs an error (RUST_LOG=glclient=debug
should show it), and there is a pending HSM request on the server that never gets resolved.
I'm mentioning this here because I have seen an HTLC being stuck because we couldn't sign off on the next state due to this missing sighash flag.
Looking into this further it looks like it is not the VLS signer checking the sighash flag, but rather the spec change to make option_anchor_outputs
not a dependency anymore. I think that's the case because we see the signer attach at 11:50:45, and we never get the detach log message, and we get requests being streamed throughout the life of the node. On the other hand we send an error: Awaiting unilateral close
because we have one channel closing. I wonder if a channel mixup is to blame here :thinking: I'll try to reproduce this in isolation on CLN, but chances are such a mixup would be fixed in CLN v22.11 already.
logs attached. I have one waiting close channel and one opened channel. Several seconds after the node starts it disconnects from the remote peer. I see this in the logs:
node-logs: UNUSUAL 02c811e575be2df47d8b48dab3d3f1c9b0f6e16d0d40b5ed78253308fc2bd7170d-chan#2: Peer permanent failure in AWAITING_UNILATERAL: Fulfilled HTLC 4 SENT_REMOVE_HTLC cltv 776491 hit deadline
c_breez.1.log