Blockstream / greenlight

Build apps using self-custodial lightning nodes in the cloud
https://blockstream.github.io/greenlight/getting-started/
MIT License
114 stars 27 forks source link

Unable to send/receive #92

Closed roeierez closed 1 year ago

roeierez commented 1 year ago

logs attached. I have one waiting close channel and one opened channel. Several seconds after the node starts it disconnects from the remote peer. I see this in the logs:

node-logs: UNUSUAL 02c811e575be2df47d8b48dab3d3f1c9b0f6e16d0d40b5ed78253308fc2bd7170d-chan#2: Peer permanent failure in AWAITING_UNILATERAL: Fulfilled HTLC 4 SENT_REMOVE_HTLC cltv 776491 hit deadline

c_breez.1.log

cdecker commented 1 year ago

At first I thought it might be because of the recent dual-funding spec change, which made the option_anchor_outputs not be a requirement of option_dual_funding. This caused some pre-22.11 nodes to drop a connection in case we advertise option_dual_funding but not option_anchor_outputs. However in that case we wouldn't get a re-establish, and the disconnect would be immediate, not delayed.

I found the following lines in the raw session log:

2023-02-22T11:51:50+01:00 {} stderr: [2023-02-22T10:51:50Z DEBUG gl_subdaemon::gossipd] Peer has disconnected: 02c811e575be2df47d8b48dab3d3f1c9b0f6e16d0d40b5ed78253308fc2bd7170d
2023-02-22T11:51:50+01:00 {} stderr: [2023-02-22T10:51:50Z ERROR gl_signerproxy::hsmproxy] Connection lost
2023-02-22T11:51:50+01:00 {} stderr: [2023-02-22T10:51:50Z WARN  gl_signerproxy::hsmproxy] hsmproxy stopped processing requests with error: processing requests
2023-02-22T11:51:50+01:00 {} stdout: DEBUG   02c811e575be2df47d8b48dab3d3f1c9b0f6e16d0d40b5ed78253308fc2bd7170d-connectd: Last ping unreturned: hanging up
2023-02-22T11:51:50+01:00 {} stdout: INFO    02c811e575be2df47d8b48dab3d3f1c9b0f6e16d0d40b5ed78253308fc2bd7170d-channeld-chan#3: Peer connection lost
2023-02-22T11:51:50+01:00 {} stdout: INFO    02c811e575be2df47d8b48dab3d3f1c9b0f6e16d0d40b5ed78253308fc2bd7170d-chan#3: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)
2023-02-22T11:51:50+01:00 {} stdout: DEBUG   02c811e575be2df47d8b48dab3d3f1c9b0f6e16d0d40b5ed78253308fc2bd7170d-connectd: destroy_subd: 1 subds, to_peer conn (nil), read_to_die = 1
2023-02-22T11:51:50+01:00 {} stdout: DEBUG   02c811e575be2df47d8b48dab3d3f1c9b0f6e16d0d40b5ed78253308fc2bd7170d-connectd: peer_conn_closed
2023-02-22T11:51:50+01:00 {} stdout: DEBUG   02c811e575be2df47d8b48dab3d3f1c9b0f6e16d0d40b5ed78253308fc2bd7170d-lightningd: peer_disconnect_done

This sounds like the peer (LSP?) isn't replying to pings, which is unexpected. Any chance we can corroborate that with logs from the other side?

cdecker commented 1 year ago

I'll also attach the raw logs, so we can better relate the two. breez-session.log

cdecker commented 1 year ago

We are also hunting down an issue with the VLS signer that may be related: https://gitlab.com/lightning-signer/validating-lightning-signer/-/issues/248

It sounds like the LDK library is validating a field that we aren't considering (sighash flag of the signature) since we must enforce a specific value anyway, so we just overwrite it, rather than validating it. Symptoms are that the signer logs an error (RUST_LOG=glclient=debug should show it), and there is a pending HSM request on the server that never gets resolved.

I'm mentioning this here because I have seen an HTLC being stuck because we couldn't sign off on the next state due to this missing sighash flag.

cdecker commented 1 year ago

Looking into this further it looks like it is not the VLS signer checking the sighash flag, but rather the spec change to make option_anchor_outputs not a dependency anymore. I think that's the case because we see the signer attach at 11:50:45, and we never get the detach log message, and we get requests being streamed throughout the life of the node. On the other hand we send an error: Awaiting unilateral close because we have one channel closing. I wonder if a channel mixup is to blame here :thinking: I'll try to reproduce this in isolation on CLN, but chances are such a mixup would be fixed in CLN v22.11 already.