Blockstream / greenlight

Build apps using self-custodial lightning nodes in the cloud
https://blockstream.github.io/greenlight/getting-started/
MIT License
106 stars 28 forks source link

track: VLS <> CLN desync in v24.02 #476

Open cdecker opened 1 week ago

cdecker commented 1 week ago

Some users have been reporting an issue with the VLS signer refusing to sign a channel re-establish (likely the commitment secret exchagne) due to the following error:

FAILED PRECONDITION: policy failure: get_per_commitment_secret: cannot revoke commitment_number 695 when next_holder_commit_num is 696

On the server-side we can see the same in the log lines:

UNUSUAL 02c811e575be2df47d8b48dab3d3f1c9b0f6e16d0d40b5ed78253308fc2bd7170d-channeld-chan#1: Adding HTLC 18446744073709551615 too slow: killing connection
[BreezSdk] {INFO} (2024-07-07T12:27:19.990279Z) : node-logs: INFO    02c811e575be2df47d8b48dab3d3f1c9b0f6e16d0d40b5ed78253308fc2bd7170d-chan#1: Peer transient failure in CHANNELD_NORMAL: Adding HTLC timed out: killed connection

Which then eventually times out the reconnection timeout, we try to pay anyway, but fail because no channels are available.

This does appear to be a new issue, but reminiscent of https://github.com/Blockstream/greenlight/issues/431 which was fixed by v24.02. There appears to be another non-atomic transition in VLS <> CLN that we are not reflecting.

devrandom commented 1 week ago

if next_holder_commit_num is 696, then the current is 695. and of course we can't revoke the current.

so the question is how we arrived at this state.

the question is whether 696 was actually signed by VLS?

logs regarding the signing of holder commitment transactions would be useful here.

cdecker commented 1 week ago

I'm not quite sure how this could happen, as it appears that the CLN node made progress while the VLS state did not? That's what I'm reading out of the error reporting that CLN is trying to revoke our current commitment number, right? But we persist the VLS state first, and only then pass it to the node, so how could the latter make progress without the former?