ZmnSCPxj / clboss

Automated Core Lightning Node Manager
MIT License
207 stars 30 forks source link

clboss makes CLN use very very much RAM #116

Closed svewa closed 2 years ago

svewa commented 2 years ago

clboss latest git, but had same issue with 0.11? before.

I run CLN on a node with ~190 peers. I have 20 GB RAM (including 8GB swap) In normal operation, without clboss, the system uses about 0.8 GB for CLN and a few plugins.

After clboss is finished with plugin-clboss: PeerCompetitorFeeMonitor: Surveying I run out of memory.

2022-05-02T17:48:10.672Z **BROKEN** 030bec0fb4a5dd5c2dffa5f008210a9b7e0689ee4bf7ae15a7e52491c65fbf4ca5-chan#22195: Could not subdaemon channel: Cannot allocate memory

top shows clightning using all the ram and 100% CPU, sometimes topology also a lot.

ZmnSCPxj commented 2 years ago

Probably related to #100 . Could you try the tarball in #105 and see if that helps?

svewa commented 2 years ago

It seems to be the process that kicks in once all peers channels are evaluated. With this tarball the bad behavior starts seconds after I load clboss, and not just after 25 minutes or so with the old code. But it's basically the same. It also blocks CLN, at least I can't getinfo orplugin stop .../clboss while I watch my RAM being eaten in top.

svewa commented 2 years ago

With this test also my CLN died. It might be due to clboss, but it also might be for another reason and both just happend when CLN started reacting again. I had no OOM kill and no memory allocation errors in log. Commands were given with tailing the log in the background, which gives some hints about timing here:

cl@cl:/tmp/clboss-speed-up-surveyor$ cl plugin start /tmp/clboss-speed-up-surveyor/clboss

2022-05-02T23:30:03.126Z INFO    plugin-clboss: clboss 0.11E
2022-05-02T23:30:03.127Z INFO    plugin-clboss: ChannelFeeManager: zerobasefee: allow
2022-05-02T23:30:03.257Z INFO    plugin-clboss: New block at 734636
2022-05-02T23:30:03.258Z INFO    plugin-clboss: DnsSeed: Have 2 seeds.
2022-05-02T23:30:03.258Z INFO    plugin-clboss: Started.

[plugin start output with list of plugins etc, successful]


2022-05-02T23:30:03.309Z INFO    plugin-clboss: AutoDisconnector: 029542b908ed7e7dbd76f039a53f2614a2e7e9696ef83b5caf872d5aa9ab71e367
2022-05-02T23:30:03.324Z INFO    plugin-clboss: ChannelCreationDecider: Onchain amount too low, add 0.117927 or more before we create channels. Will not create channels yet.
2022-05-02T23:30:03.404Z INFO    plugin-clboss: InternetConnectionMonitor: online.
2022-05-02T23:30:03.414Z INFO    plugin-clboss: ActiveProber: Probe peer 03864ef025fde8fb587d989186ce6a4a186895ee44a926bfc370e2c366597a3f8f.
2022-05-02T23:30:06.033Z INFO    plugin-prometheus.py: [{'txid': '11c3da8d95162e764347d58c62ffd9f6ec7184d0660154f47695e2fe60972968', 'output': 0, 'value': 1556512, 'amount_msat': 1556512000msat, 'scriptpubkey': '0014ea2a35b67805fc20d283c59ef6f04ea2fb2bd1cd', 'address': 'bc1qag4rtdncqh7zp55rck00duzw5tajh5wdqvlu9d', 'status': 'confirmed', 'blockheight': 733497, 'reserved': False}, {'txid': '92c3cbfc5fd039d0f360eae500b9f7b4798778e8d2fee35410df5bc07fd91bf5', 'output': 0, 'value': 3488702, 'amount_msat': 3488702000msat, 'scriptpubkey': '0014fdb570d80cb5106f1be7dfa595e26139e01d310d', 'address': 'bc1qlk6hpkqvk5gx7xl8m7jetcnp88sp6vgdnd3xu3', 'status': 'confirmed', 'blockheight': 734304, 'reserved': False}]
2022-05-02T23:30:07.806Z INFO    lightningd: Sending 160080841msat over 2 hops to deliver 160000000msat
2022-05-02T23:30:12.050Z INFO    03864ef025fde8fb587d989186ce6a4a186895ee44a926bfc370e2c366597a3f8f-chan#22825: htlc 123 failed from 1th node with code 0x400f (WIRE_INCORRECT_OR_UNKNOWN_PAYMENT_DETAILS)
2022-05-02T23:30:25.940Z INFO    plugin-clboss: ActiveProber: Finished probing peer 03864ef025fde8fb587d989186ce6a4a186895ee44a926bfc370e2c366597a3f8f.
2022-05-02T23:32:15.911Z INFO    plugin-clboss: Connector: failed to connect to 0352b2651964338267b8cb003a1ce135ffc28c2f814653836693cc828f1e5bede3@ln1qdftyegevsecyeacevqr588pxhlu9rp0s9r98qmxj0xg9rc7t0k7xc8vyw9.lseed.bitcoinstats.com.:9735
2022-05-02T23:32:15.911Z INFO    plugin-clboss: Connector: failed to connect to 02655822e2b03f889c83b5ce23b0951ee077c206a5e588efe331c8197c207dbd1c@ln1qfj4sghzkqlc38yrkh8z8vy4rms80ssx5hjc3mlrx8ypjlpq0k73c8ju7tu.lseed.bitcoinstats.com.:9735
2022-05-02T23:32:46.958Z INFO    0239e4dc85543cd84b712850072dc5e1bf0190c44b64dbe822581013f849a794a4-channeld-chan#23367: Peer connection lost
2022-05-02T23:32:46.958Z INFO    0239e4dc85543cd84b712850072dc5e1bf0190c44b64dbe822581013f849a794a4-chan#23367: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)
2022-05-02T23:33:17.409Z INFO    03fd21fdee8e9adc070b53a0bca8685cbc0b97f27b72f364723dcc54b4164ab21b-channeld-chan#22114: Peer connection lost
2022-05-02T23:33:17.409Z INFO    03fd21fdee8e9adc070b53a0bca8685cbc0b97f27b72f364723dcc54b4164ab21b-chan#22114: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)
2022-05-02T23:33:26.340Z INFO    02f51d157519ac6ec86595822ee529af53416fc4f8ba7fb8179407dcd91eaafa9a-channeld-chan#22583: Peer connection lost
2022-05-02T23:33:26.341Z INFO    02f51d157519ac6ec86595822ee529af53416fc4f8ba7fb8179407dcd91eaafa9a-chan#22583: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)
2022-05-02T23:33:26.522Z INFO    02f51d157519ac6ec86595822ee529af53416fc4f8ba7fb8179407dcd91eaafa9a-chan#22583: Peer transient failure in CHANNELD_NORMAL: Reconnected

cl plugin stop /tmp/clboss-speed-up-surveyor/clboss

2022-05-02T23:37:54.421Z INFO    0337694505123a12a8fadd95523dcc235898ad3b80a06e4a63ca26fed68dd0d17c-channeld-chan#22036: Peer connection lost
2022-05-02T23:37:54.423Z INFO    0337694505123a12a8fadd95523dcc235898ad3b80a06e4a63ca26fed68dd0d17c-chan#22036: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)
2022-05-02T23:38:53.420Z INFO    02058a29920854e5088ade34a0f89ca9c89464d6d6595c2942eb13047580a1192b-channeld-chan#21972: Peer connection lost
2022-05-02T23:38:53.421Z INFO    02058a29920854e5088ade34a0f89ca9c89464d6d6595c2942eb13047580a1192b-chan#21972: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)
2022-05-02T23:38:53.423Z INFO    03e8b9a977fa3ae7acce74c25986c7240a921222e349729737df832a1b5ceb49df-channeld-chan#21849: Peer connection lost
2022-05-02T23:38:53.423Z INFO    03e8b9a977fa3ae7acce74c25986c7240a921222e349729737df832a1b5ceb49df-chan#21849: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)
2022-05-02T23:38:53.423Z INFO    03355e6b094ea452ea33a453503d19581778f784af4558c1ba620a5aec9de0e40c-channeld-chan#22629: Peer connection lost
2022-05-02T23:38:53.424Z INFO    03355e6b094ea452ea33a453503d19581778f784af4558c1ba620a5aec9de0e40c-chan#22629: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)
2022-05-02T23:38:53.425Z INFO    0337579aadc81356ed7f32587565e4c2a7f8d1561be8cc3bd976cbc6a409a4a71b-channeld-chan#22273: Peer connection lost
2022-05-02T23:38:53.425Z INFO    0337579aadc81356ed7f32587565e4c2a7f8d1561be8cc3bd976cbc6a409a4a71b-chan#22273: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)
2022-05-02T23:38:53.425Z INFO    02c61e171140fc1aac5d3e0bcfefdc79ea954306dfa29265f02def849dc7e80419-channeld-chan#22411: Peer connection lost
2022-05-02T23:38:53.425Z INFO    02c61e171140fc1aac5d3e0bcfefdc79ea954306dfa29265f02def849dc7e80419-chan#22411: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)
2022-05-02T23:38:53.425Z INFO    03b323009613777edc4e8024c6726758396bc026da99545971164a6baad31dfc69-channeld-chan#22596: Peer connection lost
2022-05-02T23:38:53.425Z INFO    03b323009613777edc4e8024c6726758396bc026da99545971164a6baad31dfc69-chan#22596: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)
2022-05-02T23:38:53.904Z INFO    03e8b9a977fa3ae7acce74c25986c7240a921222e349729737df832a1b5ceb49df-chan#21849: Peer transient failure in CHANNELD_NORMAL: Reconnected
2022-05-02T23:38:54.128Z INFO    03355e6b094ea452ea33a453503d19581778f784af4558c1ba620a5aec9de0e40c-chan#22629: Peer transient failure in CHANNELD_NORMAL: Reconnected
2022-05-02T23:38:54.332Z INFO    02c61e171140fc1aac5d3e0bcfefdc79ea954306dfa29265f02def849dc7e80419-chan#22411: Peer transient failure in CHANNELD_NORMAL: Reconnected
2022-05-02T23:38:54.562Z INFO    02058a29920854e5088ade34a0f89ca9c89464d6d6595c2942eb13047580a1192b-chan#21972: Peer transient failure in CHANNELD_NORMAL: Reconnected
2022-05-02T23:38:56.576Z INFO    03b323009613777edc4e8024c6726758396bc026da99545971164a6baad31dfc69-chan#22596: Peer transient failure in CHANNELD_NORMAL: Reconnected
2022-05-02T23:39:20.997Z INFO    plugin-clboss: Killing plugin: stopped by lightningd via RPC
2022-05-02T23:39:20.998Z **BROKEN** lightningd: Already in transaction from "?]
2022-05-02T23:39:21.217Z **BROKEN** lightningd: FATAL SIGNAL 6 (version v0.11.0rc5)
2022-05-02T23:39:21.217Z **BROKEN** lightningd: backtrace: common/daemon.c:38 (send_backtrace) 0x563a8d308689
2022-05-02T23:39:21.217Z **BROKEN** lightningd: backtrace: common/daemon.c:46 (crashdump) 0x563a8d3086c9
2022-05-02T23:39:21.217Z **BROKEN** lightningd: backtrace: ./signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0 ((null)) 0x7f10c8a82d5f
2022-05-02T23:39:21.217Z **BROKEN** lightningd: backtrace: ../sysdeps/unix/sysv/linux/raise.c:51 (__GI_raise) 0x7f10c8a82ce1
2022-05-02T23:39:21.217Z **BROKEN** lightningd: backtrace: ./stdlib/abort.c:79 (__GI_abort) 0x7f10c8a6c536
2022-05-02T23:39:21.217Z **BROKEN** lightningd: backtrace: lightningd/log.c:821 (fatal_vfmt) 0x563a8d2cb5b9
2022-05-02T23:39:21.217Z **BROKEN** lightningd: backtrace: wallet/wallet.c:61 (db_fatal) 0x563a8d2f47ac
2022-05-02T23:39:21.217Z **BROKEN** lightningd: backtrace: db/exec.c:122 (db_begin_transaction_) 0x563a8d31c830
2022-05-02T23:39:21.217Z **BROKEN** lightningd: backtrace: lightningd/plugin_hook.c:208 (plugin_hook_callback) 0x563a8d2e7b6f
2022-05-02T23:39:21.217Z **BROKEN** lightningd: backtrace: lightningd/plugin_hook.c:134 (plugin_hook_killed) 0x563a8d2e7be7
2022-05-02T23:39:21.217Z **BROKEN** lightningd: backtrace: ccan/ccan/tal/tal.c:240 (notify) 0x563a8d36607e
2022-05-02T23:39:21.217Z **BROKEN** lightningd: backtrace: ccan/ccan/tal/tal.c:402 (del_tree) 0x563a8d366150
2022-05-02T23:39:21.217Z **BROKEN** lightningd: backtrace: ccan/ccan/tal/tal.c:412 (del_tree) 0x563a8d3661a2
2022-05-02T23:39:21.217Z **BROKEN** lightningd: backtrace: ccan/ccan/tal/tal.c:486 (tal_free) 0x563a8d36664f
2022-05-02T23:39:21.217Z **BROKEN** lightningd: backtrace: lightningd/plugin.c:373 (plugin_kill) 0x563a8d2e4cb8
2022-05-02T23:39:21.218Z **BROKEN** lightningd: backtrace: lightningd/plugin_control.c:117 (plugin_stop) 0x563a8d2e6f6f
2022-05-02T23:39:21.218Z **BROKEN** lightningd: backtrace: lightningd/plugin_control.c:172 (plugin_dynamic_stop) 0x563a8d2e6fbc
2022-05-02T23:39:21.218Z **BROKEN** lightningd: backtrace: lightningd/plugin_control.c:230 (json_plugin_control) 0x563a8d2e7420
2022-05-02T23:39:21.218Z **BROKEN** lightningd: backtrace: lightningd/jsonrpc.c:629 (command_exec) 0x563a8d2c5b75
2022-05-02T23:39:21.218Z **BROKEN** lightningd: backtrace: lightningd/jsonrpc.c:764 (rpc_command_hook_final) 0x563a8d2c6fd0
2022-05-02T23:39:21.218Z **BROKEN** lightningd: backtrace: lightningd/plugin_hook.c:209 (plugin_hook_callback) 0x563a8d2e7b53
2022-05-02T23:39:21.218Z **BROKEN** lightningd: backtrace: lightningd/plugin.c:584 (plugin_response_handle) 0x563a8d2e32f9
2022-05-02T23:39:21.218Z **BROKEN** lightningd: backtrace: lightningd/plugin.c:690 (plugin_read_json_one) 0x563a8d2e6117
2022-05-02T23:39:21.218Z **BROKEN** lightningd: backtrace: lightningd/plugin.c:735 (plugin_read_json) 0x563a8d2e62b2
2022-05-02T23:39:21.218Z **BROKEN** lightningd: backtrace: ccan/ccan/io/io.c:59 (next_plan) 0x563a8d35b533
2022-05-02T23:39:21.218Z **BROKEN** lightningd: backtrace: ccan/ccan/io/io.c:407 (do_plan) 0x563a8d35b9ba
2022-05-02T23:39:21.218Z **BROKEN** lightningd: backtrace: ccan/ccan/io/io.c:417 (io_ready) 0x563a8d35ba53
2022-05-02T23:39:21.218Z **BROKEN** lightningd: backtrace: ccan/ccan/io/poll.c:453 (io_loop) 0x563a8d35d342
2022-05-02T23:39:21.218Z **BROKEN** lightningd: backtrace: lightningd/io_loop_with_timers.c:22 (io_loop_with_timers) 0x563a8d2c5635
2022-05-02T23:39:21.218Z **BROKEN** lightningd: backtrace: lightningd/lightningd.c:1181 (main) 0x563a8d2c94c7
2022-05-02T23:39:21.218Z **BROKEN** lightningd: backtrace: ../csu/libc-start.c:308 (__libc_start_main) 0x7f10c8a6dd09
2022-05-02T23:39:21.218Z **BROKEN** lightningd: backtrace: (null):0 ((null)) 0x563a8d2ad399
2022-05-02T23:39:21.218Z **BROKEN** lightningd: backtrace: (null):0 ((null)) 0xffffffffffffffff
lightning-cli: reading response: socket closed
ZmnSCPxj commented 2 years ago

Might be related to this: https://github.com/ElementsProject/lightning/issues/5107

I usually just restart the entire C-Lightning node rather than just CLBOSS.

svewa commented 2 years ago

okay, then let's ignore the BROKEN... I just switched to loglevel debug and guess the info gathered might be useful.

I started CLBOSS and tried to stop it after one minute. As CLN did not respond, it took ~13 minutes to actually stop it. Within these 13 minutes there are 48834 instances of DEBUG lightningd: Calling rpc_command hook of plugin clboss and as many instances of DEBUG lightningd: Plugin clboss returned from rpc_command hook call

Analyzing calls and returns gives me a peak of 24147 concurrently open calls, which might explain the blocking/overloaded CLN and memory usage?

Also within those 13 minutes are 26525 listchannels calls by clboss, here are three examples with the most often (106 times) queried node and some other less often queried ones.

    106 DEBUG   plugin-clboss: Rpc out: listchannels {\"source\": \"03864ef025fde8fb587d989186ce6a4a186895ee44a926bfc370e2c366597a3f8f\"}
     51 DEBUG   plugin-clboss: Rpc out: listchannels {\"source\": \"027ce055380348d7812d2ae7745701c9f93e70c1adeb2657f053f91df4f2843c71\"}
     36 DEBUG   plugin-clboss: Rpc out: listchannels {\"source\": \"0242c1650f320c43f3600ebf3b5fcaf566760fcd421f35b68fd1e39a31b29ef540\"}

these 26525 queries were to 8203 different nodes, and the most often queried sources seemingly were peers of my node.

Here is a break down of the "Rpc out" calls:

      1 getinfo
      1 listconfigs
      1 listpays
      1 ping
      2 connect
      2 fundpsbt
      2 listfunds
      3 delpay
      3 getroute
      3 sendpay
      3 waitblockheight
      3 waitsendpay
     16 feerates
    241 listpeers
  24581 listchannels
ZmnSCPxj commented 2 years ago

Are you using the clboss-speed-up-surveyor tarball or not? (v2 or not will be the same if you run C-Lightning >= 0.10.1).

If it is, then the remaining culprit for listchannels source is FeeModderBySize. At this point I am tempted to just disconnect that module (you can edit Boss/Mod/all.cpp and remove the all->install<FeeModderBySize(bus); line, then recompile). I may need to rethink LN fee modification, and by-size is mostly relevant only for initial setting, really -- by-price-theory will learn the best fees anyway.

If not, then can you please redo with clboss-speed-up-surveyor?

Alternately, I could try putting a cap on the number of simultaneous RPC commands in Boss::Mod::Rpc. That might be a better general solution. Probably by designing a semaphore. Hmmm

svewa commented 2 years ago

I am and was running clboss-speed-up-surveyor on CLN v0.11.0rc5 and I don't understand v2.

I disabled the FeeModderBySize, and after 15 minutes it still runs smooth like never before. My main reason for using CLBOSS is that I failed coming up with a proper fee strategy myself, so was hoping to outsource this and believe CLBOSS does a good job there, hope this does not change by excluding the module. By if I understand it right, it could also be easily migrated to the destination way of doing things? Anyway, thanks for pointing out all.cpp, as this allows me to possibly disable modules, not so sure about the rebalancing stuff yet, but this is another topic.

I guess it would generally be a good idea to cap the simultaneous RPCs, but also if the whole situation of needing so many calls could be avoided it'd be without much real world impact.

ZmnSCPxj commented 2 years ago

There is a second tarball in the #105, named clboss-speed-up-surveyor-v2.tar.gz. Difference from the previous is that it autodetects C-Lightning version and uses destination if it is >=0.10.1.

So looks like even with the surveyor speedup, feemodder by-size is still a big hammer, so #105 as-is is insufficient.

By if I understand it right, it could also be easily migrated to the destination way of doing things?

No, because the hammer in the PeerCompetitorFeeMonitor::Surveyor was because we needed a list of channels to the peer, but before 0.10.1 we only had source, which was the opposite direction.

By-Size just cares about a list of channels of peers-of-peers, does not matter whether source or destination, it is still processing a ton of requests.

Another idea is to leverage Ev::memoize. Presumably the peers-of-peers have some overlapping nodes, and By-Size is not smart enough to reuse the previous call to get the channels of the node. Ev::memoize would add that smartness. Probably around compute_capacity.

svewa commented 2 years ago

ah okay... not that easy then, but what about just getting all the channels and parsing them at once for all peers-of-peers?

$ time cl listchannels > /dev/null 

real    0m2.067s
user    0m0.664s
sys 0m0.048s
cl@cl:~$ 

A one time 2s hit once in a while is certainly easier on RAM, CPU and whatnot...

ZmnSCPxj commented 2 years ago

Hmm, assuming the C-Lightning slowdown is due to command processing overhead rather than pushing bytes around, yes, that seems like a better idea. I wonder how badly this will work if the network becomes much much larger though....

svewa commented 2 years ago

For now it would remove the self size from the equation, small nodes with 10 channels might pay more, but bigger nodes would be able to still work, this is already a trade off. I'd just assume that by the time the network grows 10x maybe we get better RPC methods, supplying a list of sources/destinations maybe? I hear there is a general filtering framework coming to CLN, but no clue if this would be included already, likely not.

vincenzopalazzo commented 2 years ago

Hmm, assuming the C-Lightning slowdown is due to command processing overhead rather than pushing bytes around, yes, that seems like a better idea. I wonder how badly this will work if the network becomes much much larger though....

The solution here is implementing the batch search, I'm working on it! We can tell to c-lightning that we need the information of a list of nodes! So we make 1 call for each search operation!

Keeping the channel in memory is not a good design, because how do you know when a channel updates some info like the fee? this required making a call each X time, but without to be sure to work with the update network. off course we can apply probability also here

svewa commented 2 years ago

Your effort is part of the filtering and pagination feature planned for CLN I assume? Good to hear. So we don't need to worry now about LN getting to big if it's already in the making....

About keeping in memory.... it could be read once, keep it in memory and update it if updates are received via gossip. But with efficient searching (as I wished for and you seem to be working on) it would be duplicating the data in CLN and CLBOSS while it's efficiently available to CLBOSS and properly maintained by CLN.

vincenzopalazzo commented 2 years ago

I don't see the benefit in doing all these "update", however, can be a hack yes!

But with efficient searching

There is no efficient searching in an unordered map! The best that you can do, as we discussed in the other part is O(N) and this cost is possible with the batch searching implemented in the can.

svewa commented 2 years ago

I saw you implemented the rate limiting, so I'm running the tarball now with the patches for rate limiting and FeeModderBySize enabled again. uses quite some CPU while FeeModderBySize is running but CLN stays responsive and memory consumption is looking good/stable, too.

I still believe a parsing of all channels is way easier on ressources at the current network size, at least for bigger nodes, but I'm quite happy already. And as discussed iterating all channels might not be desirable long term. Thanks.

ZmnSCPxj commented 2 years ago

105 has code that includes parsing all the channels, but I hit a performance bug in the JSON parser, which is fixed in #119

svewa commented 2 years ago

okay, might include this one, too :) thanks.

ZmnSCPxj commented 2 years ago

Yes, but note that v3 has the performance bug. I will prepare a v4 tomorrow (maybe 12-16 hours from now) that includes the JSON parsing fixes in #119 . While #119 seems ok for merging, I want to let it stew in my mind a little more in case there is something I missed, so I will wait till tomorrow, then if nothing comes to mind, merge it, then rebase #105 and build a v4 tarball.