Closed Amperstrand closed 5 months ago
This is an interesting situation. @Amperstrand, thanks for reporting.
So, this looks like sendpay
rpc fails and it is followed by a bad memory access. Up to my knowledge sendpay
should fail (provided that the arguments are correct) only if the first hop is unreachable. We have logic to exclude from the route construction those peers that are listed as peer_connected=false
after the listpeerchannels
rpc. This was already present in version 23.08.1 and it has changed a bit afterwards, but it is there, and it works regtest in both 23.08.1 and the most recent version at the tip of the master
branch.
Therefore there must be another reason for the failure of sendpay
.
I'll add that it might (or might not) have been related to trying to pay the same invoice many times in quick succession. I will continue to test on this version and report back if it happens again.
It might also have been that I provided the wrong arguments as I was testing the clnrest interface. But if I provide the wrong arguments, I would expect a soft failure mode where it rejects my request but continues to answer subsequent requests. In this case it crashed. I am exploring if it can be safe to give semi-trusted third party wallets and apps access to a core lightning node and restricting them through runes without them being able to do damage to or drain funds from the lightning node.
I think a lot of people are still on this version because it is what you end up with if you follow the steps to deploy btcpayserver via docker.
I was able to make it crash by faking a sendpay
rpc call that fails with some error different from PAY_TRY_OTHER_ROUTE
.
cln-renepay: FATAL SIGNAL 11 (version v23.08-modded)
# we have computed a set of 1 flows with probability 0.996, fees 42msat and delay 28
# Flow 1: amount=2000000msat prob=0.996 fees=42msat delay=12 path=-597x1x1/0(min=max=987439978msat)->-604x1x0/0->-611x1x2/1->
0x55ff92590cf9 send_backtrace
common/daemon.c:33
0x55ff92590d81 crashdump
common/daemon.c:75
0x7f55e2f1ffcf ???
./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
0x55ff92573f74 flow_sendpay_failed
plugins/renepay/pay.c:303
0x55ff9258243b handle_rpc_reply
plugins/libplugin.c:860
0x55ff92582661 rpc_read_response_one
plugins/libplugin.c:1048
I couldn't trigger the same in the most recent version. But I am going deeper to make sure.
I'll add that it might (or might not) have been related to trying to pay the same invoice many times in quick succession.
It could be that. But
renepay
shouldn't duplicate sendpay
s,2024-02-05T10:47:53.139Z INFO XXnodeid-chan#5: htlc 119 failed from 1th node with code 0x100d (WIRE_INCORRECT_CLTV_EXPIRY)
2024-02-05T10:47:55.504Z INFO lightningd: Sending 45000msat over 10 hops to deliver 44000msat
2024-02-05T10:48:00.013Z INFO XXnodeid-chan#5: htlc 120 failed from 1th node with code 0x100d (WIRE_INCORRECT_CLTV_EXPIRY)
2024-02-05T10:48:02.517Z INFO lightningd: Sending 45000msat over 10 hops to deliver 44000msat
2024-02-05T10:48:09.280Z INFO XXnodeid-chan#5: htlc 121 failed from 1th node with code 0x100d (WIRE_INCORRECT_CLTV_EXPIRY)
2024-02-05T10:48:11.658Z INFO lightningd: Sending 45000msat over 10 hops to deliver 44000msat
2024-02-05T10:48:16.982Z INFO XXnodeid-chan#5: htlc 122 failed from 1th node with code 0x100d (WIRE_INCORRECT_CLTV_EXPIRY)
2024-02-05T10:48:19.333Z INFO lightningd: Sending 45000msat over 10 hops to deliver 44000msat
2024-02-05T10:48:24.153Z INFO XXnodeid-chan#5: htlc 123 failed from 1th node with code 0x100d (WIRE_INCORRECT_CLTV_EXPIRY)
2024-02-05T10:48:26.418Z INFO lightningd: Sending 45000msat over 10 hops to deliver 44000msat
2024-02-05T10:48:30.671Z INFO XXnodeid-chan#5: htlc 124 failed from 1th node with code 0x100d (WIRE_INCORRECT_CLTV_EXPIRY)
2024-02-05T10:48:30.699Z **BROKEN** plugin-clnrest.py: Error: RPC call failed: method: renepay, payload: {'invstring': 'lnbc440n1pjup0r5pp5s6zxm85vkxuzxs4m2pfpn3cvuz2pk2gp835rntlft3863nnwx6yqdq5g9kxy7fqd9h8vmmfvdjscqzzsxqyz5vqsp5vpu64l25psd2270ljy2dj6gke
f6vuc7vpezfe0nzaesah83tydvq9qyyssq0a4yf6a9t7d0psvwvsu7u2qx7jep84ajcumelyg4cyd5ahggpc05x54c0lqutfx5lqs92x3xlnvepr588kfh4eaars739rupfvewtpspfcvhjw', 'maxfee': 2000, 'retry_for': 60}, error: {'code': 210, 'message': 'Timed out'}
2024-02-05T10:48:30.699Z **BROKEN** plugin-clnrest.py: Error: {error: {'code': 210, 'message': 'Timed out'}}
2024-02-05T10:51:44.908Z INFO lightningd: Sending 45000msat over 10 hops to deliver 44000msat
cln-renepay: FATAL SIGNAL 11 (version basedon-v23.08)
0x559c14fcb5a3 send_backtrace
common/daemon.c:33
0x559c14fcb62b crashdump
common/daemon.c:75
0x7ff47844613f ???
???:0
0x559c14fae464 flow_sendpay_failed
plugins/renepay/pay.c:302
0x559c14fbd121 handle_rpc_reply
plugins/libplugin.c:860
0x559c14fbd342 rpc_read_response_one
plugins/libplugin.c:1048
0x559c14fbd3e0 rpc_conn_read_response
plugins/libplugin.c:1072
0x559c14febb5b next_plan
ccan/ccan/io/io.c:59
0x559c14febfe2 do_plan
ccan/ccan/io/io.c:407
0x559c14fec07b io_ready
ccan/ccan/io/io.c:417
0x559c14fed96a io_loop
ccan/ccan/io/poll.c:453
0x559c14fbd61a plugin_main
plugins/libplugin.c:1949
0x559c14fb0d99 main
plugins/renepay/pay.c:1334
0x7ff478268d09 ???
???:0
0x559c14fadef9 ???
???:0
0xffffffffffffffff ???
???:0
2024-02-05T10:51:45.178Z INFO plugin-cln-renepay: Killing plugin: exited during normal operation
2024-02-05T10:51:45.185Z **BROKEN** plugin-clnrest.py: Error: RPC call failed: method: renepay, payload: {'invstring': 'lnbc440n1pjup0r5pp5s6zxm85vkxuzxs4m2pfpn3cvuz2pk2gp835rntlft3863nnwx6yqdq5g9kxy7fqd9h8vmmfvdjscqzzsxqyz5vqsp5vpu64l25psd2270ljy2dj6gke
f6vuc7vpezfe0nzaesah83tydvq9qyyssq0a4yf6a9t7d0psvwvsu7u2qx7jep84ajcumelyg4cyd5ahggpc05x54c0lqutfx5lqs92x3xlnvepr588kfh4eaars739rupfvewtpspfcvhjw', 'maxfee': 2000, 'retry_for': 60}, error: {'code': -4, 'message': 'Plugin terminated before replying to RPC call.'}
2024-02-05T10:51:45.185Z **BROKEN** plugin-clnrest.py: Error: {error: {'code': -4, 'message': 'Plugin terminated before replying to RPC call.'}}
I did it again, but still unable to reproduce in a reliable way. Calling renepay over CLNrest from LNBits.
Both times I was calling renepaystatus while the payment was ongoing. From now on I will try to call this every second and log the output.
The bad memory access was already fixed in this commit 14e28a51ce579bf273f29b1c15fd6c86792586c5.
I am keeping this open because whatever happened in this case it triggered a line with a plugin_err
which is telling us that something really wrong, or we are just not properly handling some cases.
Will try to find a way to reliably reproduce this using some loops and enabling more logging.
Things that play in:
https://github.com/ElementsProject/lightning/issues/6620 talks about WIRE_INCORRECT_CLTV_EXPIRY
which is also in my logs and might be related.
https://github.com/ElementsProject/lightning/issues/6535 also talks about renepay crashing (but in a slightly different way)
Would be happy to upgrade to 23.11 but the bug not visible there is not proof that it is fixed unless I find a reliable way to reproduce it.
So for now, I will just keep updating here if it happens again or if I figure out a way to reproduce it making random payments.
If there is something I should be doing other then turning up the logging level, please leave a comment.
The problem with the bad memory access in your version is making the plugin fail before it can produce a useful log message that could be helpful to know what went wrong in your case.
After patching the bad memory access I can make the plugin break by calling renepay
with the same invoice on two different terminals at the same time:
2024-02-06T08:40:52.807Z DEBUG plugin-cln-renepay: calling flow_sendpay_failed
2024-02-06T08:40:52.807Z **BROKEN** plugin-cln-renepay: Strange error from sendpay: {\"code\":200,\"message\":\"Already have 2000000msat of 2000000msat payments in progress\"}
The issue is that calling renepay
is non-blocking, the plugin cannot stop anyone from sending the same sendpay
rpc, but it shouldn't make a sendpay
call without first checking that the same request has been performed earlier. There is already a listsendpays
rpc call that checks if we have a payment in progress (also in version 23.08). But something is wrong about it. I'll dive deeper.
I found a race condition.
The chain of events after an rpc call to renepay
are the following:
(rcp) renepay
|
|
json_pay: parses the invoice
|
|
(rpc) listsendpays
|
|
payment_listsendpays_previous: learns about the previous sendpays in order to know the amount that we need to send and to avoid sendpay collisions
|
|
(rpc) listpeerchannels
|
|
listpeerchannels_done: learns about the local channels and their balances
|
|
try_paying: does to MCF and constructs the payment routes
|
|
(rpc) sendpay
If you call renepay
twice, for example with this test:
def test_concurrency(node_factory):
l1, l2, l3 = node_factory.line_graph(3,
wait_for_announce=True,
opts=[{}, {}, {}])
inv = l3.rpc.invoice(1000, 'test_renepay', 'description')['bolt11']
p1 = subprocess.Popen(['cli/lightning-cli',
'--network={}'.format(TEST_NETWORK),
'--lightning-dir={}'
.format(l1.daemon.lightning_dir),
'-k',
'renepay', 'invstring={}'.format(inv)],
stdout=subprocess.PIPE)
p2 = subprocess.Popen(['cli/lightning-cli',
'--network={}'.format(TEST_NETWORK),
'--lightning-dir={}'
.format(l1.daemon.lightning_dir),
'-k',
'renepay', 'invstring={}'.format(inv)],
stdout=subprocess.PIPE)
p1.wait(timeout=60)
p2.wait(timeout=60)
It happens that p1
goes through (rpc) listsendpays
and learns that there are no previous payment attempts, so he goes on and calls (rpc) listpeerchannels
, then p2
starts running and he also receives from (rcp) listsendpays
that there are no previous payment attempts and calls (rpc) listpeerchannels
. Then p1
wakes up and starts try_paying
producing some (rpc) sendpay
requests, then he goes to sleep waiting for success or failure notifications. Then p2
wakes up and he also goes through try_paying
but he is unaware that while he was sleeping p1
had already submitted some (rpc) sendpay
, so he tries the same sendpay
s that p1
also tried some milliseconds ago.
To fix this, I could add an internal database for sendpay
s that can be checked before try_paying
or I can invert the order of the rpc's, first calling (rpc) listpeerchannels
and leave (rpc) listsendpays
for last so that when the plugin wakes up it discovers the latest sendpay
s directly from lightningd
and goes immediately to try_paying
.
My guess is that if the pipeline was the following, the aforementioned race condition would be present:
(rcp) renepay
|
|
json_pay: parses the invoice
|
|
(rpc) listpeerchannels
|
|
listpeerchannels_done: learns about the local channels and their balances
|
|
(rpc) listsendpays
|
|
payment_listsendpays_previous: learns about the previous sendpays in order to know the amount that we need to send and to avoid sendpay collisions
|
|
try_paying: does to MCF and constructs the payment routes
|
|
(rpc) sendpay
because there is no pause between receiving the result form listsendpays
and the new sendpay
requests.
Issue and Steps to Reproduce
Not sure if I can reproduce, but this is what I typed to pay a 1 sat bolt11 invoice to Alby with renepay:
Then I got the status with renepaystatus. It showed up the first time that I ran this, but I didn't save the data.
I tried to run renepay again and got an error about renepay plugin not running.
In the logs, I find this:
The getblock error keeps showing up in my logs because it is running on a pruned node, and I think this is a somewhat expected error which might deserve its own issue. https://github.com/btcpayserver/btcpayserver-docker/issues/373#issuecomment-700155277 and https://github.com/btcpayserver/btcpayserver/issues/5506#event-11499786086 both talk about this. So the getblock error is probably not relevant for why renepay crashed.
thoughts and comments:
I am running core lightning from docker : btcpayserver/lightning:v23.08 deployed via https://github.com/btcpayserver/btcpayserver-docker
Crashing the renepay plugin like this is good: it forces you to look into the issue. At the same time, it would have been better if it had logged that an exception happened and then restarted the plugin.
The error message is hard to diagnose or understand as an end user.
Its not impoosible that the htlc 53 error message is related to this error
I will update the issue if I am able to reproduce it or if it happens again.