ElementsProject / lightning

Core Lightning — Lightning Network implementation focusing on spec compliance and performance
Other
2.84k stars 901 forks source link

renepay crashed #7035

Closed Amperstrand closed 5 months ago

Amperstrand commented 9 months ago

Issue and Steps to Reproduce

Not sure if I can reproduce, but this is what I typed to pay a 1 sat bolt11 invoice to Alby with renepay:

lightning-cli renepay lnbc10n1pjmufe6pp5fyucwdq23n3ww37cyhsnu9qltx730zxsvd7p8jsn9uatz369qs7sdq5g9kxy7fqd9h8vmmfvdjscqzzsxqyz5vqsp5dd6m04dkgrmadszwja6xflrxz3ehkkh6j6xcevj7nj55wuwt8g3s9qyyssqk3uevjh47swmq09dy82cnp9th7jq90wnnqmu47kj9zdxmey0u9zyj3hqas4qmux6fxwvpfuu4rqu2vg9djh0v8283462awg5q5ktyfcqfz35vj

Then I got the status with renepaystatus. It showed up the first time that I ran this, but I didn't save the data.

I tried to run renepay again and got an error about renepay plugin not running.

In the logs, I find this:

2024-02-03T11:45:19.048Z INFO    lightningd: Sending 2000msat over 6 hops to deliver 1000msat
cln-renepay: FATAL SIGNAL 11 (version basedon-v23.08)
0x5596337c85a3 send_backtrace
    common/daemon.c:33
0x5596337c862b crashdump
    common/daemon.c:75
0x7f9f8852113f ???
    ???:0
0x5596337ab464 flow_sendpay_failed
    plugins/renepay/pay.c:302
0x5596337ba121 handle_rpc_reply
    plugins/libplugin.c:860
0x5596337ba342 rpc_read_response_one
    plugins/libplugin.c:1048
0x5596337ba3e0 rpc_conn_read_response
    plugins/libplugin.c:1072
0x5596337e8b5b next_plan
    ccan/ccan/io/io.c:59
0x5596337e8fe2 do_plan
    ccan/ccan/io/io.c:407
0x5596337e907b io_ready
    ccan/ccan/io/io.c:417
0x5596337ea96a io_loop
    ccan/ccan/io/poll.c:453
0x5596337ba61a plugin_main
    plugins/libplugin.c:1949
0x5596337add99 main
    plugins/renepay/pay.c:1334
0x7f9f88343d09 ???
    ???:0
0x5596337aaef9 ???
    ???:0
0xffffffffffffffff ???
    ???:0
2024-02-03T11:45:19.303Z INFO    plugin-cln-renepay: Killing plugin: exited during normal operation
2024-02-03T11:45:35.060Z UNUSUAL plugin-bcli: bitcoin-cli -datadir=/etc/bitcoin -rpcconnect=bitcoind getblock 000000000000000000... 0 exited with status 1
2024-02-03T11:45:36.074Z UNUSUAL plugin-bcli: bitcoin-cli -datadir=/etc/bitcoin -rpcconnect=bitcoind getblock 000000000000000000... 0 exited with status 1
2024-02-03T11:45:36.691Z INFO   -chan#5: htlc 53 failed from 1th node with code 0x100d (WIRE_INCORRECT_CLTV_EXPIRY)

The getblock error keeps showing up in my logs because it is running on a pruned node, and I think this is a somewhat expected error which might deserve its own issue. https://github.com/btcpayserver/btcpayserver-docker/issues/373#issuecomment-700155277 and https://github.com/btcpayserver/btcpayserver/issues/5506#event-11499786086 both talk about this. So the getblock error is probably not relevant for why renepay crashed.

thoughts and comments:

I am running core lightning from docker : btcpayserver/lightning:v23.08 deployed via https://github.com/btcpayserver/btcpayserver-docker

Crashing the renepay plugin like this is good: it forces you to look into the issue. At the same time, it would have been better if it had logged that an exception happened and then restarted the plugin.

The error message is hard to diagnose or understand as an end user.

Its not impoosible that the htlc 53 error message is related to this error

I will update the issue if I am able to reproduce it or if it happens again.

Lagrang3 commented 9 months ago

This is an interesting situation. @Amperstrand, thanks for reporting.

Lagrang3 commented 9 months ago

So, this looks like sendpay rpc fails and it is followed by a bad memory access. Up to my knowledge sendpay should fail (provided that the arguments are correct) only if the first hop is unreachable. We have logic to exclude from the route construction those peers that are listed as peer_connected=false after the listpeerchannels rpc. This was already present in version 23.08.1 and it has changed a bit afterwards, but it is there, and it works regtest in both 23.08.1 and the most recent version at the tip of the master branch.

Therefore there must be another reason for the failure of sendpay.

Amperstrand commented 9 months ago

I'll add that it might (or might not) have been related to trying to pay the same invoice many times in quick succession. I will continue to test on this version and report back if it happens again.

It might also have been that I provided the wrong arguments as I was testing the clnrest interface. But if I provide the wrong arguments, I would expect a soft failure mode where it rejects my request but continues to answer subsequent requests. In this case it crashed. I am exploring if it can be safe to give semi-trusted third party wallets and apps access to a core lightning node and restricting them through runes without them being able to do damage to or drain funds from the lightning node.

I think a lot of people are still on this version because it is what you end up with if you follow the steps to deploy btcpayserver via docker.

Lagrang3 commented 8 months ago

I was able to make it crash by faking a sendpay rpc call that fails with some error different from PAY_TRY_OTHER_ROUTE.

cln-renepay: FATAL SIGNAL 11 (version v23.08-modded)
# we have computed a set of 1 flows with probability 0.996, fees 42msat and delay 28
#   Flow 1: amount=2000000msat prob=0.996 fees=42msat delay=12 path=-597x1x1/0(min=max=987439978msat)->-604x1x0/0->-611x1x2/1->
0x55ff92590cf9 send_backtrace
    common/daemon.c:33
0x55ff92590d81 crashdump
    common/daemon.c:75
0x7f55e2f1ffcf ???
    ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
0x55ff92573f74 flow_sendpay_failed
    plugins/renepay/pay.c:303
0x55ff9258243b handle_rpc_reply
    plugins/libplugin.c:860
0x55ff92582661 rpc_read_response_one
    plugins/libplugin.c:1048

I couldn't trigger the same in the most recent version. But I am going deeper to make sure.

Lagrang3 commented 8 months ago

I'll add that it might (or might not) have been related to trying to pay the same invoice many times in quick succession.

It could be that. But

  1. renepay shouldn't duplicate sendpays,
  2. it shouldn't crash with a bad memory access.
Amperstrand commented 8 months ago
2024-02-05T10:47:53.139Z INFO    XXnodeid-chan#5: htlc 119 failed from 1th node with code 0x100d (WIRE_INCORRECT_CLTV_EXPIRY)
2024-02-05T10:47:55.504Z INFO    lightningd: Sending 45000msat over 10 hops to deliver 44000msat
2024-02-05T10:48:00.013Z INFO    XXnodeid-chan#5: htlc 120 failed from 1th node with code 0x100d (WIRE_INCORRECT_CLTV_EXPIRY)
2024-02-05T10:48:02.517Z INFO    lightningd: Sending 45000msat over 10 hops to deliver 44000msat
2024-02-05T10:48:09.280Z INFO    XXnodeid-chan#5: htlc 121 failed from 1th node with code 0x100d (WIRE_INCORRECT_CLTV_EXPIRY)
2024-02-05T10:48:11.658Z INFO    lightningd: Sending 45000msat over 10 hops to deliver 44000msat
2024-02-05T10:48:16.982Z INFO    XXnodeid-chan#5: htlc 122 failed from 1th node with code 0x100d (WIRE_INCORRECT_CLTV_EXPIRY)
2024-02-05T10:48:19.333Z INFO    lightningd: Sending 45000msat over 10 hops to deliver 44000msat
2024-02-05T10:48:24.153Z INFO    XXnodeid-chan#5: htlc 123 failed from 1th node with code 0x100d (WIRE_INCORRECT_CLTV_EXPIRY)
2024-02-05T10:48:26.418Z INFO    lightningd: Sending 45000msat over 10 hops to deliver 44000msat
2024-02-05T10:48:30.671Z INFO    XXnodeid-chan#5: htlc 124 failed from 1th node with code 0x100d (WIRE_INCORRECT_CLTV_EXPIRY)
2024-02-05T10:48:30.699Z **BROKEN** plugin-clnrest.py: Error: RPC call failed: method: renepay, payload: {'invstring': 'lnbc440n1pjup0r5pp5s6zxm85vkxuzxs4m2pfpn3cvuz2pk2gp835rntlft3863nnwx6yqdq5g9kxy7fqd9h8vmmfvdjscqzzsxqyz5vqsp5vpu64l25psd2270ljy2dj6gke
f6vuc7vpezfe0nzaesah83tydvq9qyyssq0a4yf6a9t7d0psvwvsu7u2qx7jep84ajcumelyg4cyd5ahggpc05x54c0lqutfx5lqs92x3xlnvepr588kfh4eaars739rupfvewtpspfcvhjw', 'maxfee': 2000, 'retry_for': 60}, error: {'code': 210, 'message': 'Timed out'}
2024-02-05T10:48:30.699Z **BROKEN** plugin-clnrest.py: Error: {error: {'code': 210, 'message': 'Timed out'}}
2024-02-05T10:51:44.908Z INFO    lightningd: Sending 45000msat over 10 hops to deliver 44000msat
cln-renepay: FATAL SIGNAL 11 (version basedon-v23.08)
0x559c14fcb5a3 send_backtrace
        common/daemon.c:33
0x559c14fcb62b crashdump
        common/daemon.c:75
0x7ff47844613f ???
        ???:0
0x559c14fae464 flow_sendpay_failed
        plugins/renepay/pay.c:302
0x559c14fbd121 handle_rpc_reply
        plugins/libplugin.c:860
0x559c14fbd342 rpc_read_response_one
        plugins/libplugin.c:1048
0x559c14fbd3e0 rpc_conn_read_response
        plugins/libplugin.c:1072
0x559c14febb5b next_plan
        ccan/ccan/io/io.c:59
0x559c14febfe2 do_plan
        ccan/ccan/io/io.c:407
0x559c14fec07b io_ready
        ccan/ccan/io/io.c:417
0x559c14fed96a io_loop
        ccan/ccan/io/poll.c:453
0x559c14fbd61a plugin_main
        plugins/libplugin.c:1949
0x559c14fb0d99 main
        plugins/renepay/pay.c:1334
0x7ff478268d09 ???
        ???:0
0x559c14fadef9 ???
        ???:0
0xffffffffffffffff ???
        ???:0
2024-02-05T10:51:45.178Z INFO    plugin-cln-renepay: Killing plugin: exited during normal operation
2024-02-05T10:51:45.185Z **BROKEN** plugin-clnrest.py: Error: RPC call failed: method: renepay, payload: {'invstring': 'lnbc440n1pjup0r5pp5s6zxm85vkxuzxs4m2pfpn3cvuz2pk2gp835rntlft3863nnwx6yqdq5g9kxy7fqd9h8vmmfvdjscqzzsxqyz5vqsp5vpu64l25psd2270ljy2dj6gke
f6vuc7vpezfe0nzaesah83tydvq9qyyssq0a4yf6a9t7d0psvwvsu7u2qx7jep84ajcumelyg4cyd5ahggpc05x54c0lqutfx5lqs92x3xlnvepr588kfh4eaars739rupfvewtpspfcvhjw', 'maxfee': 2000, 'retry_for': 60}, error: {'code': -4, 'message': 'Plugin terminated before replying to RPC call.'}
2024-02-05T10:51:45.185Z **BROKEN** plugin-clnrest.py: Error: {error: {'code': -4, 'message': 'Plugin terminated before replying to RPC call.'}}

I did it again, but still unable to reproduce in a reliable way. Calling renepay over CLNrest from LNBits.

Both times I was calling renepaystatus while the payment was ongoing. From now on I will try to call this every second and log the output.

Lagrang3 commented 8 months ago

The bad memory access was already fixed in this commit 14e28a51ce579bf273f29b1c15fd6c86792586c5.

Lagrang3 commented 8 months ago

I am keeping this open because whatever happened in this case it triggered a line with a plugin_err which is telling us that something really wrong, or we are just not properly handling some cases.

Amperstrand commented 8 months ago

Will try to find a way to reliably reproduce this using some loops and enabling more logging.

Things that play in:

https://github.com/ElementsProject/lightning/issues/6620 talks about WIRE_INCORRECT_CLTV_EXPIRY which is also in my logs and might be related.

https://github.com/ElementsProject/lightning/issues/6535 also talks about renepay crashing (but in a slightly different way)

Would be happy to upgrade to 23.11 but the bug not visible there is not proof that it is fixed unless I find a reliable way to reproduce it.

So for now, I will just keep updating here if it happens again or if I figure out a way to reproduce it making random payments.

If there is something I should be doing other then turning up the logging level, please leave a comment.

Lagrang3 commented 8 months ago

The problem with the bad memory access in your version is making the plugin fail before it can produce a useful log message that could be helpful to know what went wrong in your case.

Lagrang3 commented 8 months ago

After patching the bad memory access I can make the plugin break by calling renepay with the same invoice on two different terminals at the same time:

2024-02-06T08:40:52.807Z DEBUG   plugin-cln-renepay: calling flow_sendpay_failed
2024-02-06T08:40:52.807Z **BROKEN** plugin-cln-renepay: Strange error from sendpay: {\"code\":200,\"message\":\"Already have 2000000msat of 2000000msat payments in progress\"}

The issue is that calling renepay is non-blocking, the plugin cannot stop anyone from sending the same sendpay rpc, but it shouldn't make a sendpay call without first checking that the same request has been performed earlier. There is already a listsendpays rpc call that checks if we have a payment in progress (also in version 23.08). But something is wrong about it. I'll dive deeper.

Lagrang3 commented 8 months ago

I found a race condition. The chain of events after an rpc call to renepay are the following:

(rcp) renepay
     |
     |
json_pay: parses the invoice
     |
     |
(rpc) listsendpays
     |
     |
payment_listsendpays_previous: learns about the previous sendpays in order to know the amount that we need to send and to avoid sendpay collisions
     |
     |
(rpc) listpeerchannels
     |
     |
listpeerchannels_done: learns about the local channels and their balances
     |
     |
try_paying: does to MCF and constructs the payment routes
     |
     |
(rpc) sendpay

If you call renepay twice, for example with this test:

def test_concurrency(node_factory):
    l1, l2, l3 = node_factory.line_graph(3,
                                         wait_for_announce=True,
                                         opts=[{}, {}, {}])
    inv = l3.rpc.invoice(1000, 'test_renepay', 'description')['bolt11']
    p1 = subprocess.Popen(['cli/lightning-cli',
                                   '--network={}'.format(TEST_NETWORK),
                                   '--lightning-dir={}'
                                   .format(l1.daemon.lightning_dir),
                                   '-k',
                                   'renepay', 'invstring={}'.format(inv)],
                                   stdout=subprocess.PIPE)
    p2 = subprocess.Popen(['cli/lightning-cli',
                                   '--network={}'.format(TEST_NETWORK),
                                   '--lightning-dir={}'
                                   .format(l1.daemon.lightning_dir),
                                   '-k',
                                   'renepay', 'invstring={}'.format(inv)],
                                   stdout=subprocess.PIPE)
    p1.wait(timeout=60)
    p2.wait(timeout=60)

It happens that p1 goes through (rpc) listsendpays and learns that there are no previous payment attempts, so he goes on and calls (rpc) listpeerchannels, then p2 starts running and he also receives from (rcp) listsendpays that there are no previous payment attempts and calls (rpc) listpeerchannels. Then p1 wakes up and starts try_paying producing some (rpc) sendpay requests, then he goes to sleep waiting for success or failure notifications. Then p2 wakes up and he also goes through try_paying but he is unaware that while he was sleeping p1 had already submitted some (rpc) sendpay, so he tries the same sendpays that p1 also tried some milliseconds ago.

To fix this, I could add an internal database for sendpays that can be checked before try_paying or I can invert the order of the rpc's, first calling (rpc) listpeerchannels and leave (rpc) listsendpays for last so that when the plugin wakes up it discovers the latest sendpays directly from lightningd and goes immediately to try_paying.

My guess is that if the pipeline was the following, the aforementioned race condition would be present:

(rcp) renepay
     |
     |
json_pay: parses the invoice
     |
     |
(rpc) listpeerchannels
     |
     |
listpeerchannels_done: learns about the local channels and their balances
     |
     |
(rpc) listsendpays
     |
     |
payment_listsendpays_previous: learns about the previous sendpays in order to know the amount that we need to send and to avoid sendpay collisions
     |
     |
try_paying: does to MCF and constructs the payment routes
     |
     |
(rpc) sendpay

because there is no pause between receiving the result form listsendpays and the new sendpay requests.