Closed svewa closed 2 years ago
Probably related to #100 . Could you try the tarball in #105 and see if that helps?
It seems to be the process that kicks in once all peers channels are evaluated. With this tarball the bad behavior starts seconds after I load clboss, and not just after 25 minutes or so with the old code. But it's basically the same. It also blocks CLN, at least I can't getinfo
orplugin stop .../clboss
while I watch my RAM being eaten in top
.
With this test also my CLN died. It might be due to clboss, but it also might be for another reason and both just happend when CLN started reacting again. I had no OOM kill and no memory allocation errors in log. Commands were given with tailing the log in the background, which gives some hints about timing here:
cl@cl:/tmp/clboss-speed-up-surveyor$ cl plugin start /tmp/clboss-speed-up-surveyor/clboss
2022-05-02T23:30:03.126Z INFO plugin-clboss: clboss 0.11E
2022-05-02T23:30:03.127Z INFO plugin-clboss: ChannelFeeManager: zerobasefee: allow
2022-05-02T23:30:03.257Z INFO plugin-clboss: New block at 734636
2022-05-02T23:30:03.258Z INFO plugin-clboss: DnsSeed: Have 2 seeds.
2022-05-02T23:30:03.258Z INFO plugin-clboss: Started.
[plugin start output with list of plugins etc, successful]
2022-05-02T23:30:03.309Z INFO plugin-clboss: AutoDisconnector: 029542b908ed7e7dbd76f039a53f2614a2e7e9696ef83b5caf872d5aa9ab71e367
2022-05-02T23:30:03.324Z INFO plugin-clboss: ChannelCreationDecider: Onchain amount too low, add 0.117927 or more before we create channels. Will not create channels yet.
2022-05-02T23:30:03.404Z INFO plugin-clboss: InternetConnectionMonitor: online.
2022-05-02T23:30:03.414Z INFO plugin-clboss: ActiveProber: Probe peer 03864ef025fde8fb587d989186ce6a4a186895ee44a926bfc370e2c366597a3f8f.
2022-05-02T23:30:06.033Z INFO plugin-prometheus.py: [{'txid': '11c3da8d95162e764347d58c62ffd9f6ec7184d0660154f47695e2fe60972968', 'output': 0, 'value': 1556512, 'amount_msat': 1556512000msat, 'scriptpubkey': '0014ea2a35b67805fc20d283c59ef6f04ea2fb2bd1cd', 'address': 'bc1qag4rtdncqh7zp55rck00duzw5tajh5wdqvlu9d', 'status': 'confirmed', 'blockheight': 733497, 'reserved': False}, {'txid': '92c3cbfc5fd039d0f360eae500b9f7b4798778e8d2fee35410df5bc07fd91bf5', 'output': 0, 'value': 3488702, 'amount_msat': 3488702000msat, 'scriptpubkey': '0014fdb570d80cb5106f1be7dfa595e26139e01d310d', 'address': 'bc1qlk6hpkqvk5gx7xl8m7jetcnp88sp6vgdnd3xu3', 'status': 'confirmed', 'blockheight': 734304, 'reserved': False}]
2022-05-02T23:30:07.806Z INFO lightningd: Sending 160080841msat over 2 hops to deliver 160000000msat
2022-05-02T23:30:12.050Z INFO 03864ef025fde8fb587d989186ce6a4a186895ee44a926bfc370e2c366597a3f8f-chan#22825: htlc 123 failed from 1th node with code 0x400f (WIRE_INCORRECT_OR_UNKNOWN_PAYMENT_DETAILS)
2022-05-02T23:30:25.940Z INFO plugin-clboss: ActiveProber: Finished probing peer 03864ef025fde8fb587d989186ce6a4a186895ee44a926bfc370e2c366597a3f8f.
2022-05-02T23:32:15.911Z INFO plugin-clboss: Connector: failed to connect to 0352b2651964338267b8cb003a1ce135ffc28c2f814653836693cc828f1e5bede3@ln1qdftyegevsecyeacevqr588pxhlu9rp0s9r98qmxj0xg9rc7t0k7xc8vyw9.lseed.bitcoinstats.com.:9735
2022-05-02T23:32:15.911Z INFO plugin-clboss: Connector: failed to connect to 02655822e2b03f889c83b5ce23b0951ee077c206a5e588efe331c8197c207dbd1c@ln1qfj4sghzkqlc38yrkh8z8vy4rms80ssx5hjc3mlrx8ypjlpq0k73c8ju7tu.lseed.bitcoinstats.com.:9735
2022-05-02T23:32:46.958Z INFO 0239e4dc85543cd84b712850072dc5e1bf0190c44b64dbe822581013f849a794a4-channeld-chan#23367: Peer connection lost
2022-05-02T23:32:46.958Z INFO 0239e4dc85543cd84b712850072dc5e1bf0190c44b64dbe822581013f849a794a4-chan#23367: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)
2022-05-02T23:33:17.409Z INFO 03fd21fdee8e9adc070b53a0bca8685cbc0b97f27b72f364723dcc54b4164ab21b-channeld-chan#22114: Peer connection lost
2022-05-02T23:33:17.409Z INFO 03fd21fdee8e9adc070b53a0bca8685cbc0b97f27b72f364723dcc54b4164ab21b-chan#22114: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)
2022-05-02T23:33:26.340Z INFO 02f51d157519ac6ec86595822ee529af53416fc4f8ba7fb8179407dcd91eaafa9a-channeld-chan#22583: Peer connection lost
2022-05-02T23:33:26.341Z INFO 02f51d157519ac6ec86595822ee529af53416fc4f8ba7fb8179407dcd91eaafa9a-chan#22583: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)
2022-05-02T23:33:26.522Z INFO 02f51d157519ac6ec86595822ee529af53416fc4f8ba7fb8179407dcd91eaafa9a-chan#22583: Peer transient failure in CHANNELD_NORMAL: Reconnected
cl plugin stop /tmp/clboss-speed-up-surveyor/clboss
2022-05-02T23:37:54.421Z INFO 0337694505123a12a8fadd95523dcc235898ad3b80a06e4a63ca26fed68dd0d17c-channeld-chan#22036: Peer connection lost
2022-05-02T23:37:54.423Z INFO 0337694505123a12a8fadd95523dcc235898ad3b80a06e4a63ca26fed68dd0d17c-chan#22036: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)
2022-05-02T23:38:53.420Z INFO 02058a29920854e5088ade34a0f89ca9c89464d6d6595c2942eb13047580a1192b-channeld-chan#21972: Peer connection lost
2022-05-02T23:38:53.421Z INFO 02058a29920854e5088ade34a0f89ca9c89464d6d6595c2942eb13047580a1192b-chan#21972: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)
2022-05-02T23:38:53.423Z INFO 03e8b9a977fa3ae7acce74c25986c7240a921222e349729737df832a1b5ceb49df-channeld-chan#21849: Peer connection lost
2022-05-02T23:38:53.423Z INFO 03e8b9a977fa3ae7acce74c25986c7240a921222e349729737df832a1b5ceb49df-chan#21849: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)
2022-05-02T23:38:53.423Z INFO 03355e6b094ea452ea33a453503d19581778f784af4558c1ba620a5aec9de0e40c-channeld-chan#22629: Peer connection lost
2022-05-02T23:38:53.424Z INFO 03355e6b094ea452ea33a453503d19581778f784af4558c1ba620a5aec9de0e40c-chan#22629: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)
2022-05-02T23:38:53.425Z INFO 0337579aadc81356ed7f32587565e4c2a7f8d1561be8cc3bd976cbc6a409a4a71b-channeld-chan#22273: Peer connection lost
2022-05-02T23:38:53.425Z INFO 0337579aadc81356ed7f32587565e4c2a7f8d1561be8cc3bd976cbc6a409a4a71b-chan#22273: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)
2022-05-02T23:38:53.425Z INFO 02c61e171140fc1aac5d3e0bcfefdc79ea954306dfa29265f02def849dc7e80419-channeld-chan#22411: Peer connection lost
2022-05-02T23:38:53.425Z INFO 02c61e171140fc1aac5d3e0bcfefdc79ea954306dfa29265f02def849dc7e80419-chan#22411: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)
2022-05-02T23:38:53.425Z INFO 03b323009613777edc4e8024c6726758396bc026da99545971164a6baad31dfc69-channeld-chan#22596: Peer connection lost
2022-05-02T23:38:53.425Z INFO 03b323009613777edc4e8024c6726758396bc026da99545971164a6baad31dfc69-chan#22596: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)
2022-05-02T23:38:53.904Z INFO 03e8b9a977fa3ae7acce74c25986c7240a921222e349729737df832a1b5ceb49df-chan#21849: Peer transient failure in CHANNELD_NORMAL: Reconnected
2022-05-02T23:38:54.128Z INFO 03355e6b094ea452ea33a453503d19581778f784af4558c1ba620a5aec9de0e40c-chan#22629: Peer transient failure in CHANNELD_NORMAL: Reconnected
2022-05-02T23:38:54.332Z INFO 02c61e171140fc1aac5d3e0bcfefdc79ea954306dfa29265f02def849dc7e80419-chan#22411: Peer transient failure in CHANNELD_NORMAL: Reconnected
2022-05-02T23:38:54.562Z INFO 02058a29920854e5088ade34a0f89ca9c89464d6d6595c2942eb13047580a1192b-chan#21972: Peer transient failure in CHANNELD_NORMAL: Reconnected
2022-05-02T23:38:56.576Z INFO 03b323009613777edc4e8024c6726758396bc026da99545971164a6baad31dfc69-chan#22596: Peer transient failure in CHANNELD_NORMAL: Reconnected
2022-05-02T23:39:20.997Z INFO plugin-clboss: Killing plugin: stopped by lightningd via RPC
2022-05-02T23:39:20.998Z **BROKEN** lightningd: Already in transaction from "?]
2022-05-02T23:39:21.217Z **BROKEN** lightningd: FATAL SIGNAL 6 (version v0.11.0rc5)
2022-05-02T23:39:21.217Z **BROKEN** lightningd: backtrace: common/daemon.c:38 (send_backtrace) 0x563a8d308689
2022-05-02T23:39:21.217Z **BROKEN** lightningd: backtrace: common/daemon.c:46 (crashdump) 0x563a8d3086c9
2022-05-02T23:39:21.217Z **BROKEN** lightningd: backtrace: ./signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0 ((null)) 0x7f10c8a82d5f
2022-05-02T23:39:21.217Z **BROKEN** lightningd: backtrace: ../sysdeps/unix/sysv/linux/raise.c:51 (__GI_raise) 0x7f10c8a82ce1
2022-05-02T23:39:21.217Z **BROKEN** lightningd: backtrace: ./stdlib/abort.c:79 (__GI_abort) 0x7f10c8a6c536
2022-05-02T23:39:21.217Z **BROKEN** lightningd: backtrace: lightningd/log.c:821 (fatal_vfmt) 0x563a8d2cb5b9
2022-05-02T23:39:21.217Z **BROKEN** lightningd: backtrace: wallet/wallet.c:61 (db_fatal) 0x563a8d2f47ac
2022-05-02T23:39:21.217Z **BROKEN** lightningd: backtrace: db/exec.c:122 (db_begin_transaction_) 0x563a8d31c830
2022-05-02T23:39:21.217Z **BROKEN** lightningd: backtrace: lightningd/plugin_hook.c:208 (plugin_hook_callback) 0x563a8d2e7b6f
2022-05-02T23:39:21.217Z **BROKEN** lightningd: backtrace: lightningd/plugin_hook.c:134 (plugin_hook_killed) 0x563a8d2e7be7
2022-05-02T23:39:21.217Z **BROKEN** lightningd: backtrace: ccan/ccan/tal/tal.c:240 (notify) 0x563a8d36607e
2022-05-02T23:39:21.217Z **BROKEN** lightningd: backtrace: ccan/ccan/tal/tal.c:402 (del_tree) 0x563a8d366150
2022-05-02T23:39:21.217Z **BROKEN** lightningd: backtrace: ccan/ccan/tal/tal.c:412 (del_tree) 0x563a8d3661a2
2022-05-02T23:39:21.217Z **BROKEN** lightningd: backtrace: ccan/ccan/tal/tal.c:486 (tal_free) 0x563a8d36664f
2022-05-02T23:39:21.217Z **BROKEN** lightningd: backtrace: lightningd/plugin.c:373 (plugin_kill) 0x563a8d2e4cb8
2022-05-02T23:39:21.218Z **BROKEN** lightningd: backtrace: lightningd/plugin_control.c:117 (plugin_stop) 0x563a8d2e6f6f
2022-05-02T23:39:21.218Z **BROKEN** lightningd: backtrace: lightningd/plugin_control.c:172 (plugin_dynamic_stop) 0x563a8d2e6fbc
2022-05-02T23:39:21.218Z **BROKEN** lightningd: backtrace: lightningd/plugin_control.c:230 (json_plugin_control) 0x563a8d2e7420
2022-05-02T23:39:21.218Z **BROKEN** lightningd: backtrace: lightningd/jsonrpc.c:629 (command_exec) 0x563a8d2c5b75
2022-05-02T23:39:21.218Z **BROKEN** lightningd: backtrace: lightningd/jsonrpc.c:764 (rpc_command_hook_final) 0x563a8d2c6fd0
2022-05-02T23:39:21.218Z **BROKEN** lightningd: backtrace: lightningd/plugin_hook.c:209 (plugin_hook_callback) 0x563a8d2e7b53
2022-05-02T23:39:21.218Z **BROKEN** lightningd: backtrace: lightningd/plugin.c:584 (plugin_response_handle) 0x563a8d2e32f9
2022-05-02T23:39:21.218Z **BROKEN** lightningd: backtrace: lightningd/plugin.c:690 (plugin_read_json_one) 0x563a8d2e6117
2022-05-02T23:39:21.218Z **BROKEN** lightningd: backtrace: lightningd/plugin.c:735 (plugin_read_json) 0x563a8d2e62b2
2022-05-02T23:39:21.218Z **BROKEN** lightningd: backtrace: ccan/ccan/io/io.c:59 (next_plan) 0x563a8d35b533
2022-05-02T23:39:21.218Z **BROKEN** lightningd: backtrace: ccan/ccan/io/io.c:407 (do_plan) 0x563a8d35b9ba
2022-05-02T23:39:21.218Z **BROKEN** lightningd: backtrace: ccan/ccan/io/io.c:417 (io_ready) 0x563a8d35ba53
2022-05-02T23:39:21.218Z **BROKEN** lightningd: backtrace: ccan/ccan/io/poll.c:453 (io_loop) 0x563a8d35d342
2022-05-02T23:39:21.218Z **BROKEN** lightningd: backtrace: lightningd/io_loop_with_timers.c:22 (io_loop_with_timers) 0x563a8d2c5635
2022-05-02T23:39:21.218Z **BROKEN** lightningd: backtrace: lightningd/lightningd.c:1181 (main) 0x563a8d2c94c7
2022-05-02T23:39:21.218Z **BROKEN** lightningd: backtrace: ../csu/libc-start.c:308 (__libc_start_main) 0x7f10c8a6dd09
2022-05-02T23:39:21.218Z **BROKEN** lightningd: backtrace: (null):0 ((null)) 0x563a8d2ad399
2022-05-02T23:39:21.218Z **BROKEN** lightningd: backtrace: (null):0 ((null)) 0xffffffffffffffff
lightning-cli: reading response: socket closed
Might be related to this: https://github.com/ElementsProject/lightning/issues/5107
I usually just restart the entire C-Lightning node rather than just CLBOSS.
okay, then let's ignore the BROKEN... I just switched to loglevel debug and guess the info gathered might be useful.
I started CLBOSS and tried to stop it after one minute. As CLN did not respond, it took ~13 minutes to actually stop it. Within these 13 minutes there are 48834 instances of
DEBUG lightningd: Calling rpc_command hook of plugin clboss
and as many instances of
DEBUG lightningd: Plugin clboss returned from rpc_command hook call
Analyzing calls and returns gives me a peak of 24147 concurrently open calls, which might explain the blocking/overloaded CLN and memory usage?
Also within those 13 minutes are 26525 listchannels
calls by clboss, here are three examples with the most often (106 times) queried node and some other less often queried ones.
106 DEBUG plugin-clboss: Rpc out: listchannels {\"source\": \"03864ef025fde8fb587d989186ce6a4a186895ee44a926bfc370e2c366597a3f8f\"}
51 DEBUG plugin-clboss: Rpc out: listchannels {\"source\": \"027ce055380348d7812d2ae7745701c9f93e70c1adeb2657f053f91df4f2843c71\"}
36 DEBUG plugin-clboss: Rpc out: listchannels {\"source\": \"0242c1650f320c43f3600ebf3b5fcaf566760fcd421f35b68fd1e39a31b29ef540\"}
these 26525 queries were to 8203 different nodes, and the most often queried sources seemingly were peers of my node.
Here is a break down of the "Rpc out" calls:
1 getinfo
1 listconfigs
1 listpays
1 ping
2 connect
2 fundpsbt
2 listfunds
3 delpay
3 getroute
3 sendpay
3 waitblockheight
3 waitsendpay
16 feerates
241 listpeers
24581 listchannels
Are you using the clboss-speed-up-surveyor
tarball or not? (v2 or not will be the same if you run C-Lightning >= 0.10.1).
If it is, then the remaining culprit for listchannels source
is FeeModderBySize
. At this point I am tempted to just disconnect that module (you can edit Boss/Mod/all.cpp
and remove the all->install<FeeModderBySize(bus);
line, then recompile). I may need to rethink LN fee modification, and by-size is mostly relevant only for initial setting, really -- by-price-theory will learn the best fees anyway.
If not, then can you please redo with clboss-speed-up-surveyor
?
Alternately, I could try putting a cap on the number of simultaneous RPC commands in Boss::Mod::Rpc
. That might be a better general solution. Probably by designing a semaphore. Hmmm
I am and was running clboss-speed-up-surveyor
on CLN v0.11.0rc5
and I don't understand v2
.
I disabled the FeeModderBySize, and after 15 minutes it still runs smooth like never before. My main reason for using CLBOSS is that I failed coming up with a proper fee strategy myself, so was hoping to outsource this and believe CLBOSS does a good job there, hope this does not change by excluding the module. By if I understand it right, it could also be easily migrated to the destination
way of doing things?
Anyway, thanks for pointing out all.cpp, as this allows me to possibly disable modules, not so sure about the rebalancing stuff yet, but this is another topic.
I guess it would generally be a good idea to cap the simultaneous RPCs, but also if the whole situation of needing so many calls could be avoided it'd be without much real world impact.
There is a second tarball in the #105, named clboss-speed-up-surveyor-v2.tar.gz
. Difference from the previous is that it autodetects C-Lightning version and uses destination
if it is >=0.10.1.
So looks like even with the surveyor speedup, feemodder by-size is still a big hammer, so #105 as-is is insufficient.
By if I understand it right, it could also be easily migrated to the
destination
way of doing things?
No, because the hammer in the PeerCompetitorFeeMonitor::Surveyor
was because we needed a list of channels to the peer, but before 0.10.1 we only had source
, which was the opposite direction.
By-Size just cares about a list of channels of peers-of-peers, does not matter whether source
or destination
, it is still processing a ton of requests.
Another idea is to leverage Ev::memoize
. Presumably the peers-of-peers have some overlapping nodes, and By-Size is not smart enough to reuse the previous call to get the channels of the node. Ev::memoize
would add that smartness. Probably around compute_capacity
.
ah okay... not that easy then, but what about just getting all the channels and parsing them at once for all peers-of-peers?
$ time cl listchannels > /dev/null
real 0m2.067s
user 0m0.664s
sys 0m0.048s
cl@cl:~$
A one time 2s hit once in a while is certainly easier on RAM, CPU and whatnot...
Hmm, assuming the C-Lightning slowdown is due to command processing overhead rather than pushing bytes around, yes, that seems like a better idea. I wonder how badly this will work if the network becomes much much larger though....
For now it would remove the self size from the equation, small nodes with 10 channels might pay more, but bigger nodes would be able to still work, this is already a trade off. I'd just assume that by the time the network grows 10x maybe we get better RPC methods, supplying a list of sources/destinations maybe? I hear there is a general filtering framework coming to CLN, but no clue if this would be included already, likely not.
Hmm, assuming the C-Lightning slowdown is due to command processing overhead rather than pushing bytes around, yes, that seems like a better idea. I wonder how badly this will work if the network becomes much much larger though....
The solution here is implementing the batch search, I'm working on it! We can tell to c-lightning that we need the information of a list of nodes! So we make 1 call for each search operation!
Keeping the channel in memory is not a good design, because how do you know when a channel updates some info like the fee? this required making a call each X time, but without to be sure to work with the update network. off course we can apply probability also here
Your effort is part of the filtering and pagination feature planned for CLN I assume? Good to hear. So we don't need to worry now about LN getting to big if it's already in the making....
About keeping in memory.... it could be read once, keep it in memory and update it if updates are received via gossip. But with efficient searching (as I wished for and you seem to be working on) it would be duplicating the data in CLN and CLBOSS while it's efficiently available to CLBOSS and properly maintained by CLN.
I don't see the benefit in doing all these "update", however, can be a hack yes!
But with efficient searching
There is no efficient searching in an unordered map! The best that you can do, as we discussed in the other part is O(N) and this cost is possible with the batch searching implemented in the can.
I saw you implemented the rate limiting, so I'm running the tarball now with the patches for rate limiting and FeeModderBySize enabled again. uses quite some CPU while FeeModderBySize is running but CLN stays responsive and memory consumption is looking good/stable, too.
I still believe a parsing of all channels is way easier on ressources at the current network size, at least for bigger nodes, but I'm quite happy already. And as discussed iterating all channels might not be desirable long term. Thanks.
okay, might include this one, too :) thanks.
Yes, but note that v3 has the performance bug. I will prepare a v4 tomorrow (maybe 12-16 hours from now) that includes the JSON parsing fixes in #119 . While #119 seems ok for merging, I want to let it stew in my mind a little more in case there is something I missed, so I will wait till tomorrow, then if nothing comes to mind, merge it, then rebase #105 and build a v4 tarball.
clboss latest git, but had same issue with 0.11? before.
I run CLN on a node with ~190 peers. I have 20 GB RAM (including 8GB swap) In normal operation, without clboss, the system uses about 0.8 GB for CLN and a few plugins.
After clboss is finished with
plugin-clboss: PeerCompetitorFeeMonitor: Surveying
I run out of memory.2022-05-02T17:48:10.672Z **BROKEN** 030bec0fb4a5dd5c2dffa5f008210a9b7e0689ee4bf7ae15a7e52491c65fbf4ca5-chan#22195: Could not subdaemon channel: Cannot allocate memory
top
shows clightning using all the ram and 100% CPU, sometimes topology also a lot.