Blockstream / greenlight

Build apps using self-custodial lightning nodes in the cloud
https://blockstream.github.io/greenlight/getting-started/
MIT License
109 stars 27 forks source link

Ran out of routes after X attempts #421

Open darioAnongba opened 4 months ago

darioAnongba commented 4 months ago

Hi everyone,

I've been looking around and realised this issue is common and other people are experiencing it. I open an issue nonetheless because I think it's node specific so we can track it. For a long time I thought this was a Breez issue but it seems to be a Greenlight one as the Breez folks asked me to open an issue here.

The issue

Sending a payment to any node (tried: Alby, Phoenix, Blink and own personal node) always fails with

Payment timeout: status: Unknown, message: \"Error calling method Pay: RpcError { code: Some(210), message: \\\"Ran out of routes to try after X attempts: see `paystatus`\\\", data: None }\", details: [], metadata: MetadataMap { headers: {\"content-type\": \"application/grpc\", \"date\": \"Wed, 01 May 2024 11:33:58 GMT\", \"content-length\": \"0\"} }

Using the Breez SDK.

Here is the node info:

{
    "id": "022a91bfafa45364d4a28c67572867829d3244002fec8ae42c9a6e670c7ad27b1f",
    "block_height": 841641,
    "channels_balance_msat": 17951697,
    "onchain_balance_msat": 0,
    "pending_onchain_balance_msat": 0,
    "utxos": [],
    "max_payable_msat": 17951697,
    "max_receivable_msat": 3982048303,
    "max_single_payment_amount_msat": 4294967000,
    "max_chan_reserve_msats": 0,
    "connected_peers": [],
    "inbound_liquidity_msats": 327001150
}

and the LSP info:

{
    "id": "03cea51f-b654-4fb0-8e82-eca137f236a0",
    "name": "Cloud Breez LSP",
    "widget_url": "",
    "pubkey": "02c811e575be2df47d8b48dab3d3f1c9b0f6e16d0d40b5ed78253308fc2bd7170d",
    "host": "212.83.139.32:9835",
    "base_fee_msat": 1000,
    "fee_rate": 0.00001,
    "time_lock_delta": 34,
    "min_htlc_msat": 600,
    "lsp_pubkey":...,
    "opening_fee_params_list": {
        "values": [
            {
                "min_msat": 2500000,
                "proportional": 4000,
                "valid_until": "2024-05-01T13:52:51.895Z",
                "max_idle_time": 4320,
                "max_client_to_self_delay": 432,
                "promise": "20fb2e610d7ec7003ce3110ce4068e6819279ebbd94e7adfde0480687ba9c5110750dd04bd4f710d3be04c0c275d8fb5560ed47a0e741d054207c1581be87ec050"
            },
            {
                "min_msat": 2500001,
                "proportional": 4000,
                "valid_until": "2024-05-15T11:52:51.895Z",
                "max_idle_time": 4320,
                "max_client_to_self_delay": 432,
                "promise": "1fd73473781f25f81689574b0cc8eb3545f5cec9d39cc2727adbe0089a0335ad5b2edc8f10d412e6e3f50d78638e978b00d228ebc385d4aadc1dfa24a922fdf1dd"
            }
        ]
    }
}

Wouldn't mind simply withdrawing the sats of this node and create a new one as this one still uses the invite code from Breez but we'd like to switch to using Greenlight certificates.

Thanks for your time, Dario

cdecker commented 3 months ago

Sadly the error messages returned by the pay method are far from perfect, and almost everything returns this obtuse catch all error since we try and retry until we ultimately run out of routes to try. Since I wrote the plugin that's my fault, and we'll make them more informative asap. In the meantime we can help identify the problems by wading through the logs and seeing how individual attempts failed.

As a matter of fact, we're hoping to get some feedback on how to better report these in the future. Would adding an array of attempts in the getroute format along with the error details be better or more confusing? If it is confusing, how should we condense the results down to something simpler to understand? Mind you the results are intended to be consumed by developers, not end users, so I expect the more info the better it is.

darioAnongba commented 3 months ago

Sorry I didn't answer this earlier (as I was stuck with the other bug lol).

I retried again today and the error is still the same, impossible to send any payment from the node. Tbh I don't need this node anymore and it's fine for us to ditch it but we still have some sats on it (22951697 msats to be precise). So up to you:

No sats shall be wasted.

cdecker commented 3 months ago

Interesting, this is the second node I see today that is running into the HTLC addition timeout:

2024-05-01T11:53:14+02:00 {} stdout: UNUSUAL 02c811e575be2df47d8b48dab3d3f1c9b0f6e16d0d40b5ed78253308fc2bd7170d-channeld-chan#2: Adding HTLC 18446744073709551615 too slow: killing connection                                                                                                                                                                                                                                                                                             
2024-05-01T11:53:14+02:00 {} stdout: DEBUG   02c811e575be2df47d8b48dab3d3f1c9b0f6e16d0d40b5ed78253308fc2bd7170d-channeld-chan#2: Status closed, but not exited. Killing                                                                                                                                                                                                                                                                                                                    
2024-05-01T11:53:14+02:00 {} stdout: INFO    02c811e575be2df47d8b48dab3d3f1c9b0f6e16d0d40b5ed78253308fc2bd7170d-chan#2: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (9)                                                                                                                                                                                                                                                                            
2024-05-01T11:53:14+02:00 {} stdout: DEBUG   02c811e575be2df47d8b48dab3d3f1c9b0f6e16d0d40b5ed78253308fc2bd7170d-chan#2: Failing HTLC 18446744073709551615 due to peer death                                                                                                                                                                                                                                                                                                                
2024-05-01T11:53:14+02:00 {} stdout: DEBUG   02c811e575be2df47d8b48dab3d3f1c9b0f6e16d0d40b5ed78253308fc2bd7170d-chan#2: local_routing_failure: 4103 (WIRE_TEMPORARY_CHANNEL_FAILURE)

We're still investigating this in another issue, but maybe you can add a bit of color to this. The following are potential sources that we'd like to exclude:

  1. High latency link node <> signer (a slow signer could spend the timeout transferring the request)
  2. Low bandwidth link node <> signer (the state sync protocol adds a bit of overhead, around 10KB of data, so a link <0.5KB/s may cause this issue)
  3. High latency link node <> peer
  4. Just a slow peer

The first two you can likely exclude by trying to perform a payment while connected to a WLAN that is low latency and high bandwidth.