Open whitslack opened 2 years ago
First step is not to do dumb things, second step is to optimize dumb things :)
We scan for two reasons: first, when they connect we send them any gossip we made ourselves. We do this in a naive way, by scanning the entire store. Fix: put our own gossip in a separate store file, which is what @cdecker wants to share gossip files anyway. This adds some gossmap complexity, however, which now needs to handle two files.
Second, when they send a gossip_timestamp_filter message, we scan the entire store to see if any match the filter they've given. But it's not unusual to send a dummy filter to say "I don't want anything from you": LND and CLN both use 0xFFFFFFFF for this, so I've optimized that.
v0.11.2 might be ever so slightly better on this issue, but lightning_connectd
is still hovering around 90% CPU utilization.
Is it expected to have a constant parade of
INFO ...-channeld-chan#233342: Peer connection lost
INFO ...-chan#233342: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)
and
INFO ...-chan#153417: Peer transient failure in CHANNELD_NORMAL: Reconnected
in the log? I know I have a lot of peers, but the rate of link flapping still seems excessive. And how is "Reconnected" ever the explanation of a "Peer transient failure"? That seems fishy to me.
Also, I have noticed that after some time my node stops accepting incoming connections entirely. I thought it was happening due to a file descriptor mixup while running under Heaptrack, but it happened again even with no Heaptrack in the mix.
These log messages are overzealous. I'm seeing the "Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died" mainly when peers don't respond to pings (which is a message you only get at DEBUG level, but should probably be INFO).
With 21 connected peers, I am seeing 36 of these in 24 hours (yeah, one peer responsible for over half).
The "Reconnected" message is when they reconnect to us and we have an already-live connection, so we abandon the old one in favor of the new; I've seen 3 of these in the same period.
I haven't seen the failure to accept incoming connections! That's weird...
I haven't seen the failure to accept incoming connections!
I think it was due to connectd
exceeding the file descriptor limit, even though my limit is set at 4096 and I have fewer than 1000 open channels. (I've lost so many channels since upgrading to 0.11.x.)
fd leak? That should show up in ls -l proc/$(pidof lightning_connectd)/fd
. Mine shows 59 after 5 days... I'll check with the debugger tomorrow to see if any are untracked.
@rustyrussell: My lightning_connectd
currently has 2341 open FDs.
We should take the discussion of the potential FD leak over to #5353 and leave this issue for the CPU usage.
I'm going to leave this PR open. Though it's mitigated for 0.12 in #5342 I know that CPU usage can be further significantly reduced. @whitslack is most likely to see this (the CPU usage will be on first connect), so I'm leaving it open, earmarked for next release.
With the pile of recent v23.02 optimizations, this should be better. Please reopen if I'm wrong!
Quoting @rustyrussell:
We do the dumbest possible thing here, which is to retry in commit_time_msec (10 by default).
If this ethos is prevalent throughout the codebase, then it starts to make sense why CLN is so CPU hungry. Maybe something similar is at play in lightning_connectd
, which presently on my system has consumed 564 minutes of CPU time, far more than the 165 minutes for lightningd
itself and 51 minutes for lightning_hsmd
. My current daemon has only been running for about 23 hours of wall-clock time, so connectd
is still averaging about 41% CPU utilization.
That's fascinating: can you give me a perf report
of what's happening?
$ perf record --call-graph dwarf -p `pidof lightning_channeld`
<wait for a while then ^C>
$ perf report > perf.report
(I usually use perf report in tui mode, but the man page suggests it will do something useful if redirected!)
Quoting @rustyrussell:
We do the dumbest possible thing here, which is to retry in commit_time_msec (10 by default).
If this ethos is prevalent throughout the codebase, then it starts to make sense why CLN is so CPU hungry.
Well, kind of? Premature optimization is the root of all evil, but complexity isn't free either.
$ perf record --call-graph dwarf -p `pidof lightning_channeld` <wait for a while then ^C> $ perf report > perf.report
@rustyrussell: Assuming you meant pidof lightning_connectd
.
I let perf
collect data until its perf.data
was 4 GB in size.
I don't know what to make of the report, but here it is: perf.report
It looks like a lot of overhead in the poll
syscall. Are you really not using epoll(7)
?
# command ls -Fl "/proc/$(pidof lightning_connectd)/fd" | grep -Fc socket:
1515
Why does connectd
own so many sockets? That's more than the number of open channels I have, let alone actually connected channels.
Issue and Steps to Reproduce
Running v0.11.1,
lightning_connectd
is utilizing 100% of a CPU core. Attachingstrace
to the process reveals that it is hammeringpread64
calls on thegossip_store
file. The reads are all very small, most exactly 12 bytes in size, and none in my sample even approached the page size. Why is connectd not memory-mapping this file for performance? Syscalls are to be minimized wherever possible.