Blockstream / greenlight

Build apps using self-custodial lightning nodes in the cloud
https://blockstream.github.io/greenlight/getting-started/
MIT License
118 stars 27 forks source link

track: VLS <> CLN desync in v24.02 #476

Open cdecker opened 4 months ago

cdecker commented 4 months ago

Some users have been reporting an issue with the VLS signer refusing to sign a channel re-establish (likely the commitment secret exchagne) due to the following error:

FAILED PRECONDITION: policy failure: get_per_commitment_secret: cannot revoke commitment_number 695 when next_holder_commit_num is 696

On the server-side we can see the same in the log lines:

UNUSUAL 02c811e575be2df47d8b48dab3d3f1c9b0f6e16d0d40b5ed78253308fc2bd7170d-channeld-chan#1: Adding HTLC 18446744073709551615 too slow: killing connection
[BreezSdk] {INFO} (2024-07-07T12:27:19.990279Z) : node-logs: INFO    02c811e575be2df47d8b48dab3d3f1c9b0f6e16d0d40b5ed78253308fc2bd7170d-chan#1: Peer transient failure in CHANNELD_NORMAL: Adding HTLC timed out: killed connection

Which then eventually times out the reconnection timeout, we try to pay anyway, but fail because no channels are available.

This does appear to be a new issue, but reminiscent of https://github.com/Blockstream/greenlight/issues/431 which was fixed by v24.02. There appears to be another non-atomic transition in VLS <> CLN that we are not reflecting.

devrandom commented 4 months ago

if next_holder_commit_num is 696, then the current is 695. and of course we can't revoke the current.

so the question is how we arrived at this state.

the question is whether 696 was actually signed by VLS?

logs regarding the signing of holder commitment transactions would be useful here.

cdecker commented 4 months ago

I'm not quite sure how this could happen, as it appears that the CLN node made progress while the VLS state did not? That's what I'm reading out of the error reporting that CLN is trying to revoke our current commitment number, right? But we persist the VLS state first, and only then pass it to the node, so how could the latter make progress without the former?

cdecker commented 4 months ago

I have a node (03442386091363708e6e3bcda4dd978d7777743fde7a4990166cdec2e5065149f1/ prod/ttt) that is affected by this issue. Pulling logs, state and CLN DB to reproduce the case. Let's see if we can find the exact cause for the desync, or at least reproduce the state and then come up with a fix after the fact to recover.

cdecker commented 4 months ago

We have the following entry in the DB for the channel:

  3 |       1 |                  |                    5 |                     6 |     3 |      1 |             0 |             0 |            22092 |             22089 |        10842 | \x91fcb6f06b5b94586b13101b827bb204d78137d87031a346b27501280b239f0d |                 0 |          105000 |                     1 |             0 |        1330066 | \x03daa3e7827dcda27b240797ff43691f62ef5db21a3c78da57b33b9c99f280592b | \x02e338223fb0b618b46f221ed5fd2d4094da857ca63dc72ba6e9cba8bf477fb72f | \x03c9e0ad6900853d166362c7c6a15524c40046f8bd201ea140b247d514fd57d618 | \x02595a3116d6984f639f3d0510c303c9dc286b03a9fccefb009ca4088cd00653d3 | \x033e1e74247f8b4825a308560b4f40dc20c4837f3f7543a55ae81d8264ecf2f9c1 | \x03a947b234e56bc725692f99d40c2421146d58f4ac8eafe0b1fefc8eb71392e895 | \x03ecb61cb9a984e727f38f94f5f298e18ebbbb4296a0bd408bb5003be156acafd0 |                      |                       |                  3 |                                                                        |                     3 |                        |                     | \x70736274ff010204020000000103049a91f02001040101010501040106010301fb04020000000001012b289a01000000000022002087f8eb092754e219e7a66a6b900360a4cd31b927ed4855469bcbd1268e49b69a220203daa3e7827dcda27b240797ff43691f62ef5db21a3c78da57b33b9c99f280592b473044022044b87211cb947c15edb8331dc7600f731326e1315bc2bcbf40007d6032d74bef02201012db5ad200ee9af5d9c91a3291736ab86903fa0297f3d2664304c60bb1900d010103040100000001054752210293585196fac8ce82ba42cae3c51deda74667739edf287588cbbe69bf3a07a98d2103daa3e7827dcda27b240797ff43691f62ef5db21a3c78da57b33b9c99f280592b52ae22060293585196fac8ce82ba42cae3c51deda74667739edf287588cbbe69bf3a07a98d084330c3f000000000220603daa3e7827dcda27b240797ff43691f62ef5db21a3c78da57b33b9c99f280592b086c959ddf00000000010e2091fcb6f06b5b94586b13101b827bb204d78137d87031a346b27501280b239f0d010f0400000000011004539ca180000101282103daa3e7827dcda27b240797ff43691f62ef5db21a3c78da57b33b9c99f280592bac736460b2680103084a01000000000000010422002096516bcdab4c23f3fa72953bd7214d1fed2c5b537462593b5e5af212d9c96ea300010128210293585196fac8ce82ba42cae3c51deda74667739edf287588cbbe69bf3a07a98dac736460b2680103084a010000000000000104220020d5d031569d1d9b2697f4e6c58064cae906570a0ca246592abb644b1e070fb28e0001014d6321026fdba2f7c432947b88de2c15567679d08d343b1c25a2a5b5ef586577f058d91f67029000b2752102877e42c9c5704f1d9fdc3f2a7f933a3657cf80ddaf941ccbcbf2e011c808f98368ac010308320500000000000001042200207157815c402b50c0949747da7d649bcd7467531893e05599a9e5dd65065c12c9000101252103c9e0ad6900853d166362c7c6a15524c40046f8bd201ea140b247d514fd57d618ad51b2010308368c0100000000000104220020586919ba64a087365ece03fe2fd9648aac3d09e81e324b41593ddad4aada812100 | \x44b87211cb947c15edb8331dc7600f731326e1315bc2bcbf40007d6032d74bef1012db5ad200ee9af5d9c91a3291736ab86903fa0297f3d2664304c60bb1900d |                      |                      |         808181 |               0 |                  32 |                    32 |            41794214 |              41794214 |                10842 |                   9411 |             91166719 |               40464148 |                 1250 |                20638 |                  0 |           10000140 |                             | \x080000000000002a59 |         1000 |          10 |                                |                     |                        |                       0 | \x5120f380254c56eefffb235a5f30c2099e7aab4a308caa78b00f801f37e41370029b |                   0 |                     0 | \x91fcb6f06b5b94586b13101b827bb204d78137d87031a346b27501280b239f0d |              |      2 |                   3 |                               0 | \x036e4b8a8e12b5ad38f3230d23b5f003d9b15e9573f1d8476bfa8377d684b41b10 | \x03219ada05bf77e5a90bca7af3bbb70dc9467cd87c5a2cf7e38890fb5377079fb4 | \x0274bac2cc991ef2e158a7b2db36f06e07337d42a1fd7f20d85ac5030717baed97 | \x03edc7507bc2ad555b1a1ba1ebaf8931e99bbd6945c4c6f510f0b3ed0fcb05c089 | \x0293585196fac8ce82ba42cae3c51deda74667739edf287588cbbe69bf3a07a98d |                     |                       |                            0 |                             0 |                  |                     |                    |            0 |         105000000 |                 0 | -8834417532920193432 |  3982172984378634980 | 888605506458025984 |                             0 |                            0 | \x400000401000   |                 1 |                1000 |                 10 |                       34 |                       -1 |                        0 |             1720866916

Notice the nedxt_index_local=22092 which is also reflected in the below error message:

[2024-07-20T14:46:39Z DEBUG gl_client::signer] Handling message GetPerCommitmentPoint(GetPerCommitmentPoint { commitment_number: 22092 })
[2024-07-20T14:46:39Z DEBUG gl_client::signer] Current approvals: []
[2024-07-20T14:46:39Z INFO  vls_protocol_signer::handler] Restore node 03442386091363708e6e3bcda4dd978d7777743fde7a4990166cdec2e5065149f1
[2024-07-20T14:46:39Z INFO  lightning_signer::node]   Restore channel 02c811e575be2df47d8b48dab3d3f1c9b0f6e16d0d40b5ed78253308fc2bd7170d0100000000000000 outpoint Some(OutPoint { txid: 2aadd40a3058ddcbf8791602478a72ab43ef582d03737f938751e15b7065444e, vout: 0 })
[2024-07-20T14:46:39Z INFO  lightning_signer::node]   Restore channel 02c811e575be2df47d8b48dab3d3f1c9b0f6e16d0d40b5ed78253308fc2bd7170d0200000000000000 outpoint Some(OutPoint { txid: ff010cbb204960a71eb4acdead435cd1949cfbff6dee674156a6abb541ada7c3, vout: 0 })
[2024-07-20T14:46:39Z INFO  lightning_signer::node]   Restore channel 02c811e575be2df47d8b48dab3d3f1c9b0f6e16d0d40b5ed78253308fc2bd7170d0300000000000000 outpoint Some(OutPoint { txid: 0d9f230b280175b246a33170d83781d704b27b821b10136b58945b6bf0b6fc91, vout: 0 })
[2024-07-20T14:46:39Z INFO  lightning_signer::node] Restore node 03442386091363708e6e3bcda4dd978d7777743fde7a4990166cdec2e5065149f1 on testnet
[2024-07-20T14:46:39Z INFO  vls_protocol_signer::handler] vls_protocol_signer::handler::HandlerBuilder::do_build:: NodeState::summary 0344: 1 invoices, 0 issued_invoices, 1 payments, excess_amount 0
[2024-07-20T14:46:39Z INFO  vls_protocol_signer::handler] signer protocol_version 5, node hsm_wire_max_version 5, setting protocol_version to 5
[2024-07-20T14:46:39Z DEBUG gl_client::signer] Updating state from 0 context request
[2024-07-20T14:46:39Z ERROR lightning_signer::util::status] FAILED PRECONDITION: policy failure: get_per_commitment_point: commitment_number 22092 invalid when next_holder_commit_num is 22092
[2024-07-20T14:46:39Z ERROR lightning_signer::util::status] BACKTRACE:
       0: backtrace::capture::Backtrace::create

[2024-07-20T14:46:39Z ERROR vls_protocol_signer::handler] Signing(Status { code: FailedPrecondition, message: "policy failure: get_per_commitment_point: commitment_number 22092 invalid when next_holder_commit_num is 22092" })

And the signer state for this channel is this (formatted as a JSON document):

{
    "blockheight": None,
    "channel_setup": {
        "channel_value_sat": 105000,
        "commitment_type": "AnchorsZeroFeeHtlc",
        "counterparty_points": {
            "delayed_payment_basepoint": "033e1e74247f8b4825a308560b4f40dc20c4837f3f7543a55ae81d8264ecf2f9c1",
            "funding_pubkey": "03daa3e7827dcda27b240797ff43691f62ef5db21a3c78da57b33b9c99f280592b",
            "htlc_basepoint": "02595a3116d6984f639f3d0510c303c9dc286b03a9fccefb009ca4088cd00653d3",
            "payment_point": "03c9e0ad6900853d166362c7c6a15524c40046f8bd201ea140b247d514fd57d618",
            "revocation_basepoint": "02e338223fb0b618b46f221ed5fd2d4094da857ca63dc72ba6e9cba8bf477fb72f"
        },
        "counterparty_selected_contest_delay": 144,
        "counterparty_shutdown_script": None,
        "funding_outpoint": {"txid": "91fcb6f06b5b94586b13101b827bb204d78137d87031a346b27501280b239f0d", "vout": 0},
        "holder_selected_contest_delay": 144,
        "holder_shutdown_script": None,
        "is_outbound": False,
        "push_value_msat": 0
    },
    "channel_value_satoshis": 105000,
    "enforcement_state": {
        "channel_closed": False,
        "counterparty_secrets": {
            "old_secrets": [
                [[216, 142, 164, 23, 77, 205, 108, 33, 114, 15, 186, 148, 200, 222, 118, 180, 109, 0, 222, 144, 68, 188, 165, 0, 151, 28, 154, 24, 161, 115, 81, 40], 281474976688569],
                [[129, 119, 227, 98, 245, 155, 79, 163, 184, 252, 82, 148, 98, 89, 99, 239, 181, 13, 206, 209, 156, 250, 79, 109, 156, 41, 4, 217, 34, 195, 79, 181], 281474976688570],
                [[189, 236, 242, 157, 175, 129, 246, 142, 121, 103, 118, 37, 162, 128, 245, 210, 253, 42, 238, 143, 28, 239, 113, 233, 214, 88, 248, 204, 171, 134, 236, 230], 281474976688572],
                [[148, 181, 99, 13, 185, 143, 42, 51, 211, 235, 107, 6, 97, 64, 181, 118, 204, 213, 21, 125, 155, 6, 96, 222, 64, 237, 46, 29, 178, 153, 26, 70], 281474976688568],
                [[59, 54, 146, 91, 49, 211, 208, 113, 170, 170, 215, 51, 98, 254, 155, 170, 55, 112, 116, 19, 110, 70, 248, 198, 162, 76, 248, 215, 119, 160, 246, 27], 281474976688592],
                [[87, 170, 72, 42, 98, 152, 174, 197, 80, 115, 100, 179, 187, 16, 198, 38, 168, 197, 30, 87, 205, 147, 111, 38, 56, 90, 156, 80, 10, 9, 53, 229], 281474976688608],
                [[101, 181, 121, 243, 54, 211, 24, 37, 130, 3, 9, 171, 217, 133, 152, 12, 52, 121, 106, 67, 64, 236, 174, 148, 181, 84, 86, 9, 223, 86, 55, 101], 281474976688576],
                [[145, 166, 87, 205, 168, 157, 2, 64, 22, 223, 188, 69, 180, 17, 228, 143, 165, 1, 106, 40, 49, 207, 224, 35, 33, 214, 216, 53, 223, 210, 138, 243], 281474976688768],
                [[173, 217, 166, 113, 98, 235, 5, 3, 138, 106, 109, 59, 75, 29, 192, 253, 175, 76, 22, 172, 209, 174, 87, 2, 225, 10, 219, 32, 215, 46, 14, 128], 281474976688896],
                [[42, 143, 149, 197, 27, 180, 199, 250, 249, 108, 15, 70, 187, 128, 76, 189, 39, 51, 180, 112, 39, 176, 207, 19, 246, 164, 235, 91, 113, 113, 222, 114], 281474976688640],
                [[241, 50, 195, 86, 217, 83, 174, 42, 250, 189, 30, 189, 84, 75, 181, 3, 206, 235, 166, 138, 19, 40, 13, 255, 191, 175, 37, 7, 248, 245, 166, 159], 281474976689152],
                [[77, 214, 9, 134, 195, 112, 174, 9, 107, 164, 61, 226, 217, 85, 27, 101, 152, 182, 169, 235, 96, 72, 44, 90, 148, 72, 170, 51, 26, 9, 227, 72], 281474976692224],
                [[222, 36, 179, 80, 238, 24, 83, 5, 61, 14, 223, 138, 19, 156, 165, 76, 225, 200, 68, 127, 3, 248, 203, 146, 217, 95, 130, 18, 217, 218, 34, 194], 281474976690176],
                [[218, 140, 179, 160, 87, 48, 211, 238, 168, 39, 239, 120, 155, 55, 14, 101, 180, 207, 38, 200, 101, 223, 42, 0, 32, 182, 45, 49, 102, 206, 129, 198], 281474976702464],
                [[9, 12, 81, 184, 24, 251, 48, 65, 240, 120, 65, 255, 28, 207, 23, 191, 169, 81, 95, 92, 242, 129, 116, 251, 216, 111, 249, 46, 37, 127, 33, 191], 281474976694272],
                [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 281474976710656],
                [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 281474976710656],
                [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 281474976710656],
                [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 281474976710656],
                [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 281474976710656],
                [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 281474976710656],
                [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 281474976710656],
                [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 281474976710656],
                [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 281474976710656],
                [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 281474976710656],
                [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 281474976710656],
                [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 281474976710656],
                [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 281474976710656],
                [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 281474976710656],
                [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 281474976710656],
                [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 281474976710656],
                [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 281474976710656],
                [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 281474976710656],
                [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 281474976710656],
                [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 281474976710656],
                [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 281474976710656],
                [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 281474976710656],
                [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 281474976710656],
                [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 281474976710656],
                [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 281474976710656],
                [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 281474976710656],
                [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 281474976710656],
                [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 281474976710656],
                [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 281474976710656],
                [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 281474976710656],
                [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 281474976710656],
                [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 281474976710656],
                [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 281474976710656],
                [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 281474976710656]
            ]
        },
        "current_counterparty_commit_info": {"feerate_per_kw": 1405, "is_counterparty_broadcaster": True, "offered_htlcs": [], "received_htlcs": [], "to_broadcaster_value_sat": 101430, "to_countersigner_value_sat": 1330},
        "current_counterparty_point": "03ecb61cb9a984e727f38f94f5f298e18ebbbb4296a0bd408bb5003be156acafd0",
        "current_counterparty_signatures": ["3044022044b87211cb947c15edb8331dc7600f731326e1315bc2bcbf40007d6032d74bef02201012db5ad200ee9af5d9c91a3291736ab86903fa0297f3d2664304c60bb1900d", []],
        "current_holder_commit_info": {"feerate_per_kw": 1405, "is_counterparty_broadcaster": False, "offered_htlcs": [], "received_htlcs": [], "to_broadcaster_value_sat": 1330, "to_countersigner_value_sat": 101430},
        "initial_holder_value": 0,
        "next_counterparty_commit_num": 22089,
        "next_counterparty_revoke_num": 22088,
        "next_holder_commit_info": None,
        "next_holder_commit_num": 22092,
        "previous_counterparty_commit_info": None,
        "previous_counterparty_point": "028dd8f530f122ebc4fb19fecd7c7a9312088db9dbb5d917516e2626c8d1d8e933"
    },
    "id": None
}

Let"s dig in :-)

cdecker commented 4 months ago

Notice that this is my personal node, which is why I'm happy to share this information publicly. This is not something we do for user nodes, but in this case not having to jump through hoops is useful ^^

devrandom commented 4 months ago

so, yes, it's looks desynced - CLN thinks that 22092 is the current holder commit num, and VLS thinks it's the next.

but I was more interested in logs, to understand how we got there.

devrandom commented 4 months ago

or to put it more precisely:

if 22092 is the current commit num, then 22091 must have been revoked. so the thing to check is that VLS actually got that revocation. if it didn't, then something is wrong on the CLN side. if it did, then something is wrong on VLS side.

just to make sure about correct 2-phase commit: you first persist the VLS state, and only when that is successful you return the result to CLN, right?

cdecker commented 4 months ago

Yeah, I'll look through the logs asap to see how we got into this state (I have them pulled, just didn't have time to give them a go).

The 2PC idea is essentially that we delay the ratcheting on the vls state until CLN has committed, and make the ratcheting idempotent (so that we can replay it against the signer to sync it up again). From what I can tell here CLN advanced for some reason but the VLS state lagged behind, so maybe this is a revocation that CLN processed but didn't quite forward to the signer yet?

devrandom commented 4 months ago

well, I don't see how that can logically happen, because VLS would not have released the revocation secret if it didn't actually revoke and persist. that's the mystery...

ksedgwic commented 4 months ago

Maybe VLS regressed the commitment number due to VLS #502

devrandom commented 4 months ago

ah, very good point. will follow up offline

devrandom commented 3 months ago

the relevant fix is in 0.12, currently in release-candidate, should be final mid next week.

devrandom commented 3 months ago

to clarify, our fix prevents a specific desync scenario. it does not correct a desynced state that has already been persisted