ElementsProject / lightning

Core Lightning — Lightning Network implementation focusing on spec compliance and performance
Other
2.85k stars 901 forks source link

lightningd stuck while funding a channel with a custom backend and eat memory around 20 GB of RAM #6183

Closed vincenzopalazzo closed 1 year ago

vincenzopalazzo commented 1 year ago

While testing a new backend plugin for core lightning, and testing it I get stuck with the fundchannel command.

Originally I thought that this was a problem from my plugin side https://github.com/coffee-tools/folgore/issues/21 but then I make a stack report from core lightning with kill -SEGV <PID> and then I get the following report

2023-04-16T10:58:23.543Z DEBUG   plugin-folgore_plugin: call estimate fee info
2023-04-16T10:58:23.544Z DEBUG   plugin-folgore_plugin: fee estimated from nakamoto {6: 1240000000, 12: 1350000000, 100: 3610000000, 2: 1270000000}
2023-04-16T10:58:23.544Z DEBUG   plugin-folgore_plugin: Ok(Object {\"delayed_to_us\": Number(1350000000), \"htlc_resolution\": Number(1240000000), \"max_acceptable\": Number(2540000000), \"min_acceptable\": Number(1805000000), \"mutual_close\": Number(3610000000), \"opening\": Number(1350000000), \"penalty\": Number(1350000000), \"unilateral_close\": Number(1240000000)})
2023-04-16T10:58:30.538Z DEBUG   gossipd: REPLY WIRE_GOSSIPD_GET_ADDRS_REPLY with 0 fds
2023-04-16T10:58:34.731Z DEBUG   027a1cd64065dc39275b011ae7d89e04ff523bb21b84a50a1b160e84e3a715a838-connectd: Failed connected out: All addresses failed: 18.217.156.121:9735: Cryptographic handshake: Connection reset by peer. 
2023-04-16T10:58:34.731Z DEBUG   027a1cd64065dc39275b011ae7d89e04ff523bb21b84a50a1b160e84e3a715a838-lightningd: Will try reconnect in 300 seconds
2023-04-16T10:58:35.891Z DEBUG   gossipd: REPLY WIRE_GOSSIPD_GET_ADDRS_REPLY with 0 fds
2023-04-16T10:58:35.892Z DEBUG   03241310d658378e07a8ba0b56818e63d9e296d5b8eff77f0f2fca7dc93836969e-connectd: Failed connected out: All addresses failed: 127.0.0.1:41308: Connection establishment: Connection refused. 
2023-04-16T10:58:35.892Z DEBUG   03241310d658378e07a8ba0b56818e63d9e296d5b8eff77f0f2fca7dc93836969e-lightningd: Will try reconnect in 300 seconds
2023-04-16T10:58:36.822Z DEBUG   gossipd: REPLY WIRE_GOSSIPD_GET_ADDRS_REPLY with 0 fds
2023-04-16T10:58:36.822Z DEBUG   023fddaa53e214aa3d6f2e205366baf28f3bbe5f8bf056336dd53c84319c536d3c-connectd: Failed connected out: All addresses failed: 127.0.0.1:53480: Connection establishment: Connection refused. 
2023-04-16T10:58:36.822Z DEBUG   023fddaa53e214aa3d6f2e205366baf28f3bbe5f8bf056336dd53c84319c536d3c-lightningd: Will try reconnect in 300 seconds
2023-04-16T10:58:38.380Z DEBUG   connectd: Error connecting to s4fozzy225kfh4quwf72j43ivo6oyhjejx2k7g4ywjp3nd4qc3yszxqd.onion: Tor server reply: host unreachable
2023-04-16T10:58:38.380Z DEBUG   0249768036bb283400993987970122eeb41cac7abda6fd0199f03ce1aac6a7e67f-connectd: Failed connected out: All addresses failed: Error connecting to s4fozzy225kfh4quwf72j43ivo6oyhjejx2k7g4ywjp3nd4qc3yszxqd.onion: Tor server reply: host unreachable. s4fozzy225kfh4quwf72j43ivo6oyhjejx2k7g4ywjp3nd4qc3yszxqd.onion:19735: Connection establishment: Connection refused. 
2023-04-16T10:58:38.380Z DEBUG   0249768036bb283400993987970122eeb41cac7abda6fd0199f03ce1aac6a7e67f-lightningd: Will try reconnect in 300 seconds
2023-04-16T10:58:41.310Z DEBUG   plugin-spenderp: mfc 34: multiconnect.
2023-04-16T10:58:41.310Z DEBUG   plugin-spenderp: mfc 34, dest 0: connect 02dc20c9568d70ed13e76a757e45d8aabef62875932e130554394cb8596704ec29.
2023-04-16T10:58:41.310Z DEBUG   gossipd: REPLY WIRE_GOSSIPD_GET_ADDRS_REPLY with 0 fds
2023-04-16T10:58:44.399Z DEBUG   connectd: Now try LN connect out for host mnwtl236pyzacsgw7rwzjfxp5wmuyty2h5bgoob7sie7bbcaqzvxcxid.onion
2023-04-16T10:58:44.399Z DEBUG   02dc20c9568d70ed13e76a757e45d8aabef62875932e130554394cb8596704ec29-connectd: Connected out, starting crypto
2023-04-16T10:58:45.204Z DEBUG   03b39d1ddf13ce486de74e9e44e0538f960401a9ec75534ba9cfe4100d65426880-hsmd: Got WIRE_HSMD_ECDH_REQ
2023-04-16T10:58:45.204Z DEBUG   hsmd: Client: Received message 1 from client
2023-04-16T10:58:45.204Z DEBUG   02dc20c9568d70ed13e76a757e45d8aabef62875932e130554394cb8596704ec29-connectd: Connect OUT
2023-04-16T10:58:45.204Z DEBUG   02dc20c9568d70ed13e76a757e45d8aabef62875932e130554394cb8596704ec29-connectd: peer_out WIRE_INIT
2023-04-16T10:58:45.838Z DEBUG   02dc20c9568d70ed13e76a757e45d8aabef62875932e130554394cb8596704ec29-connectd: peer_in WIRE_INIT
2023-04-16T10:58:45.838Z DEBUG   lightningd: Calling peer_connected hook of plugin chanbackup
2023-04-16T10:58:45.838Z DEBUG   02dc20c9568d70ed13e76a757e45d8aabef62875932e130554394cb8596704ec29-connectd: Handed peer, entering loop
2023-04-16T10:58:45.838Z DEBUG   lightningd: Plugin chanbackup returned from peer_connected hook call
2023-04-16T10:58:45.839Z DEBUG   plugin-spenderp: mfc 34, dest 0: connect done.
2023-04-16T10:58:45.839Z DEBUG   plugin-spenderp: mfc 34: multiconnect done.
2023-04-16T10:58:45.839Z DEBUG   plugin-spenderp: mfc 34: 'parsefeerate' done
2023-04-16T10:58:45.839Z DEBUG   plugin-spenderp: mfc 34: fundpsbt.
2023-04-16T11:21:13.597Z **BROKEN** lightningd: FATAL SIGNAL 11 (version v23.02.2-232-g8d4f345)
2023-04-16T11:21:13.598Z **BROKEN** lightningd: backtrace: common/daemon.c:38 (send_backtrace) 0x55a89a8f77b5
2023-04-16T11:21:13.598Z **BROKEN** lightningd: backtrace: common/daemon.c:75 (crashdump) 0x55a89a8f793b
2023-04-16T11:21:13.598Z **BROKEN** lightningd: backtrace: (null):0 ((null)) 0x7f1f2cd77f4f
2023-04-16T11:21:13.598Z **BROKEN** lightningd: backtrace: (null):0 ((null)) 0x7f1f2cdd204c
2023-04-16T11:21:13.598Z **BROKEN** lightningd: backtrace: (null):0 ((null)) 0x7f1f2cdd4e62
2023-04-16T11:21:13.598Z **BROKEN** lightningd: backtrace: (null):0 ((null)) 0x7f1f2cfc64df
2023-04-16T11:21:13.598Z **BROKEN** lightningd: backtrace: (null):0 ((null)) 0x7f1f2d05c72b
2023-04-16T11:21:13.598Z **BROKEN** lightningd: backtrace: (null):0 ((null)) 0x7f1f2d044dcf
2023-04-16T11:21:13.598Z **BROKEN** lightningd: backtrace: (null):0 ((null)) 0x7f1f2d0b9376
2023-04-16T11:21:13.598Z **BROKEN** lightningd: backtrace: (null):0 ((null)) 0x7f1f2d063cc6
2023-04-16T11:21:13.598Z **BROKEN** lightningd: backtrace: (null):0 ((null)) 0x7f1f2d032c0e
2023-04-16T11:21:13.598Z **BROKEN** lightningd: backtrace: (null):0 ((null)) 0x7f1f2d033402
2023-04-16T11:21:13.598Z **BROKEN** lightningd: backtrace: (null):0 ((null)) 0x7f1f2d033579
2023-04-16T11:21:13.598Z **BROKEN** lightningd: backtrace: db/db_sqlite3.c:226 (db_sqlite3_query) 0x55a89a961eab
2023-04-16T11:21:13.598Z **BROKEN** lightningd: backtrace: db/utils.c:148 (db_query_prepared_canfail) 0x55a89a95fb85
2023-04-16T11:21:13.598Z **BROKEN** lightningd: backtrace: db/utils.c:156 (db_query_prepared) 0x55a89a95fbc9
2023-04-16T11:21:13.598Z **BROKEN** lightningd: backtrace: wallet/wallet.c:590 (wallet_find_utxo) 0x55a89a8d56f0
2023-04-16T11:21:13.598Z **BROKEN** lightningd: backtrace: wallet/reservation.c:487 (json_fundpsbt) 0x55a89a8e80fe
2023-04-16T11:21:13.598Z **BROKEN** lightningd: backtrace: lightningd/jsonrpc.c:655 (command_exec) 0x55a89a886aa7
2023-04-16T11:21:13.598Z **BROKEN** lightningd: backtrace: lightningd/jsonrpc.c:783 (rpc_command_hook_final) 0x55a89a8870b6
2023-04-16T11:21:13.598Z **BROKEN** lightningd: backtrace: lightningd/plugin_hook.c:288 (plugin_hook_call_) 0x55a89a8c3e8f
2023-04-16T11:21:13.598Z **BROKEN** lightningd: backtrace: lightningd/jsonrpc.c:871 (plugin_hook_call_rpc_command) 0x55a89a8874af
2023-04-16T11:21:13.598Z **BROKEN** lightningd: backtrace: lightningd/jsonrpc.c:979 (parse_request) 0x55a89a887a18
2023-04-16T11:21:13.598Z **BROKEN** lightningd: backtrace: lightningd/jsonrpc.c:1084 (read_json) 0x55a89a887ec4
2023-04-16T11:21:13.598Z **BROKEN** lightningd: backtrace: ccan/ccan/io/io.c:59 (next_plan) 0x55a89a9ee701
2023-04-16T11:21:13.598Z **BROKEN** lightningd: backtrace: ccan/ccan/io/io.c:407 (do_plan) 0x55a89a9ef2cc
2023-04-16T11:21:13.598Z **BROKEN** lightningd: backtrace: ccan/ccan/io/io.c:417 (io_ready) 0x55a89a9ef30a
2023-04-16T11:21:13.598Z **BROKEN** lightningd: backtrace: ccan/ccan/io/poll.c:453 (io_loop) 0x55a89a9f15d4
2023-04-16T11:21:13.598Z **BROKEN** lightningd: backtrace: lightningd/io_loop_with_timers.c:22 (io_loop_with_timers) 0x55a89a884ce6
2023-04-16T11:21:13.598Z **BROKEN** lightningd: backtrace: lightningd/lightningd.c:1231 (main) 0x55a89a88b7b0
2023-04-16T11:21:13.598Z **BROKEN** lightningd: backtrace: (null):0 ((null)) 0x7f1f2cd6278f
2023-04-16T11:21:13.598Z **BROKEN** lightningd: backtrace: (null):0 ((null)) 0x7f1f2cd62849
2023-04-16T11:21:13.598Z **BROKEN** lightningd: backtrace: (null):0 ((null)) 0x55a89a859d24
2023-04-16T11:21:13.598Z **BROKEN** lightningd: backtrace: (null):0 ((null)) 0xffffffffffffffff

and core lightning is stuck with the following command

clightning --testnet fundchannel 02dc20c9568d70ed13e76a757e45d8aabef62875932e130554394cb8596704ec29 40000

Putting this there because the stacktrace in wallet_find_utxo sounds worrying to me!

vincenzopalazzo commented 1 year ago

Ok looks like that with wrong (or too big estimation fee 2023-04-16T10:58:23.544Z DEBUG plugin-folgore_plugin: fee estimated from nakamoto {6: 1240000000, 12: 1350000000, 100: 3610000000, 2: 1270000000}) we some loop in cln

rustyrussell commented 1 year ago

I cannot reproduce this. See #6204

TonyGiorgio commented 1 year ago

We just hit this in production and our node is unusable. @vincenzopalazzo is the solution just to upgrade to the latest version?

vincenzopalazzo commented 1 year ago

Maybe I never trigger the bug again, but if this happens means that bcli plugin is going crazy.

I see also you PR, did you solve the problem of upgrading to the new bitcoin version?

TonyGiorgio commented 1 year ago

We're still getting infrastructure updated across the board and will find out soon. Though we aren't doing the same thing as you were by running a custom backend that was providing fee rates, so I highly doubt this will fix our issue and there's some other critical bug with the fundchannel call. I'll try to get a stack report when we get to that point.

vincenzopalazzo commented 1 year ago

I see keep me updated so we can debug it

TonyGiorgio commented 1 year ago

@vincenzopalazzo It's still happening with 23.08:

root@ip-10-20-28-53:/opt/pysetup# tail -f /mnt/shared/cln.log
2023-09-08T14:39:56.052Z DEBUG   0313d313d45ce0b10c406a2d249d7643865451aa8e783923909cb794dbc0b083f0-onchaind-chan#1601: billboard: 1 outputs unresolved: waiting confirmation that we spent DELAYED_OUTPUT_TO_US (ac9f8e209aed564dc4808a2395303a15932ef70cd93aa71443552e5602f30259:0) using OUR_DELAYED_RETURN_TO_WALLET
2023-09-08T14:39:56.526Z DEBUG   plugin-spenderp: mfc 34: multiconnect.
2023-09-08T14:39:56.526Z DEBUG   plugin-spenderp: mfc 34, dest 0: connect 03789e3087822b0e5d94c6e5f452131a174d4da580c6a355d9434be6e8f7469438.
2023-09-08T14:39:56.527Z DEBUG   lightningd: Already connected via 127.0.0.1:40056
2023-09-08T14:39:56.653Z DEBUG   0313d313d45ce0b10c406a2d249d7643865451aa8e783923909cb794dbc0b083f0-onchaind-chan#1601: Got new message WIRE_ONCHAIND_DEPTH
2023-09-08T14:39:56.771Z DEBUG   plugin-spenderp: mfc 34, dest 0: connect done.
2023-09-08T14:39:56.771Z DEBUG   plugin-spenderp: mfc 34: multiconnect done.
2023-09-08T14:39:56.922Z DEBUG   0313d313d45ce0b10c406a2d249d7643865451aa8e783923909cb794dbc0b083f0-onchaind-chan#1601: FUNDING_TRANSACTION/FUNDING_OUTPUT->OUR_UNILATERAL depth 2848
2023-09-08T14:39:56.922Z DEBUG   plugin-spenderp: mfc 34: 'parsefeerate' done
2023-09-08T14:39:56.922Z DEBUG   plugin-spenderp: mfc 34: fundpsbt.
2023-09-08T14:45:07.109Z **BROKEN** lightningd: FATAL SIGNAL 11 (version v23.08)
2023-09-08T14:45:07.109Z **BROKEN** lightningd: backtrace: common/daemon.c:38 (send_backtrace) 0x56446df10545
2023-09-08T14:45:07.109Z **BROKEN** lightningd: backtrace: common/daemon.c:75 (crashdump) 0x56446df106cb
2023-09-08T14:45:07.109Z **BROKEN** lightningd: backtrace: (null):0 ((null)) 0x7fae257e513f
2023-09-08T14:45:07.109Z **BROKEN** lightningd: backtrace: ccan/ccan/tal/tal.c:320 (init_property) 0x56446e14abe7
2023-09-08T14:45:07.109Z **BROKEN** lightningd: backtrace: ccan/ccan/tal/tal.c:339 (add_notifier_property) 0x56446e14ac5d
2023-09-08T14:45:07.109Z **BROKEN** lightningd: backtrace: ccan/ccan/tal/tal.c:584 (tal_add_notifier_) 0x56446e14b51d
2023-09-08T14:45:07.109Z **BROKEN** lightningd: backtrace: common/daemon.c:141 (add_steal_notifier) 0x56446df10926
2023-09-08T14:45:07.109Z **BROKEN** lightningd: backtrace: ccan/ccan/tal/tal.c:248 (notify) 0x56446e14a9e8
2023-09-08T14:45:07.109Z **BROKEN** lightningd: backtrace: ccan/ccan/tal/tal.c:477 (tal_alloc_) 0x56446e14b13a
2023-09-08T14:45:07.109Z **BROKEN** lightningd: backtrace: ccan/ccan/tal/tal.c:506 (tal_alloc_arr_) 0x56446e14b237
2023-09-08T14:45:07.109Z **BROKEN** lightningd: backtrace: db/bindings.c:529 (db_col_arr_) 0x56446df2e8e4
2023-09-08T14:45:07.109Z **BROKEN** lightningd: backtrace: wallet/wallet.c:253 (wallet_stmt2output) 0x56446deebc9a
2023-09-08T14:45:07.110Z **BROKEN** lightningd: backtrace: wallet/wallet.c:596 (wallet_find_utxo) 0x56446deec669
2023-09-08T14:45:07.110Z **BROKEN** lightningd: backtrace: wallet/reservation.c:547 (json_fundpsbt) 0x56446deff31f
2023-09-08T14:45:07.110Z **BROKEN** lightningd: backtrace: lightningd/jsonrpc.c:658 (command_exec) 0x56446de97d6a
2023-09-08T14:45:07.110Z **BROKEN** lightningd: backtrace: lightningd/jsonrpc.c:786 (rpc_command_hook_final) 0x56446de98359
2023-09-08T14:45:07.110Z **BROKEN** lightningd: backtrace: lightningd/plugin_hook.c:285 (plugin_hook_call_) 0x56446ded6314
2023-09-08T14:45:07.110Z **BROKEN** lightningd: backtrace: lightningd/jsonrpc.c:874 (plugin_hook_call_rpc_command) 0x56446de98752
2023-09-08T14:45:07.110Z **BROKEN** lightningd: backtrace: lightningd/jsonrpc.c:984 (parse_request) 0x56446de98ced
2023-09-08T14:45:07.110Z **BROKEN** lightningd: backtrace: lightningd/jsonrpc.c:1090 (read_json) 0x56446de991ae
2023-09-08T14:45:07.110Z **BROKEN** lightningd: backtrace: ccan/ccan/io/io.c:59 (next_plan) 0x56446e138efd
2023-09-08T14:45:07.110Z **BROKEN** lightningd: backtrace: ccan/ccan/io/io.c:407 (do_plan) 0x56446e139aa7
2023-09-08T14:45:07.110Z **BROKEN** lightningd: backtrace: ccan/ccan/io/io.c:417 (io_ready) 0x56446e139ae5
2023-09-08T14:45:07.110Z **BROKEN** lightningd: backtrace: ccan/ccan/io/poll.c:453 (io_loop) 0x56446e13bcfb
2023-09-08T14:45:07.110Z **BROKEN** lightningd: backtrace: lightningd/io_loop_with_timers.c:22 (io_loop_with_timers) 0x56446de96060
2023-09-08T14:45:07.110Z **BROKEN** lightningd: backtrace: lightningd/lightningd.c:1332 (main) 0x56446de9cc3e
2023-09-08T14:45:07.110Z **BROKEN** lightningd: backtrace: (null):0 ((null)) 0x7fae255b4d09
2023-09-08T14:45:07.110Z **BROKEN** lightningd: backtrace: (null):0 ((null)) 0x56446de67419
2023-09-08T14:45:07.110Z **BROKEN** lightningd: backtrace: (null):0 ((null)) 0xffffffffffffffff

Further, CLI commands for everything, including getinfo take longer than a minute with this new release.

We are going to do a chain resync now, but any guidance on how to proceed?

TonyGiorgio commented 1 year ago

For ease, here's some of the code for each of the parts in the stack trace:

https://github.com/ElementsProject/lightning/blob/98ad46795f2dfdfbef0d0df3b5d7a6ba7f142b1e/wallet/wallet.c#L553-L605

https://github.com/ElementsProject/lightning/blob/98ad46795f2dfdfbef0d0df3b5d7a6ba7f142b1e/wallet/wallet.c#L221-L274

https://github.com/ElementsProject/lightning/blob/98ad46795f2dfdfbef0d0df3b5d7a6ba7f142b1e/db/bindings.c#L513-L533

And then it goes into tal_alloc_arr_.

This has not been a problem all year and just surfaced yesterday. There's maybe 1k+ channels it has had. When the lock up happens, it behaves exactly like described in the OP. It eats up RAM slowly until it runs out.

vincenzopalazzo commented 1 year ago

I need to look more at the stacktrace, I will do in a couple of hours, but this is a crash caused by a OOM? or just a a crash?

I am thinking that this can be a bug introduced in this release?

gkrizek commented 1 year ago

@vincenzopalazzo There is no real 'crash'. CLN just completely freezes up and stops responding. Then its memory consumption grows until there is some OOM kill situation.

This was happening on v23.05.1 and v23.08

vincenzopalazzo commented 1 year ago

Thanks to confirm so this is the same bug, putting in the point release queue lets see if we can debug it thanks

TonyGiorgio commented 1 year ago

I've reproduced it and got a fix at https://github.com/ElementsProject/lightning/pull/6657

The temp solution I think is to fund the node with more higher value UTXOs, but we are still doing mantenance on our node so we have not verified it yet.