connectd pegging 100% CPU

whitslack commented 2 years ago

Issue and Steps to Reproduce

Running v0.11.1, lightning_connectd is utilizing 100% of a CPU core. Attaching strace to the process reveals that it is hammering pread64 calls on the gossip_store file. The reads are all very small, most exactly 12 bytes in size, and none in my sample even approached the page size. Why is connectd not memory-mapping this file for performance? Syscalls are to be minimized wherever possible.

rustyrussell commented 2 years ago

First step is not to do dumb things, second step is to optimize dumb things :)

We scan for two reasons: first, when they connect we send them any gossip we made ourselves. We do this in a naive way, by scanning the entire store. Fix: put our own gossip in a separate store file, which is what @cdecker wants to share gossip files anyway. This adds some gossmap complexity, however, which now needs to handle two files.

Second, when they send a gossip_timestamp_filter message, we scan the entire store to see if any match the filter they've given. But it's not unusual to send a dummy filter to say "I don't want anything from you": LND and CLN both use 0xFFFFFFFF for this, so I've optimized that.

whitslack commented 2 years ago

v0.11.2 might be ever so slightly better on this issue, but lightning_connectd is still hovering around 90% CPU utilization.

Is it expected to have a constant parade of

INFO    ...-channeld-chan#233342: Peer connection lost
INFO    ...-chan#233342: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)

and

INFO    ...-chan#153417: Peer transient failure in CHANNELD_NORMAL: Reconnected

in the log? I know I have a lot of peers, but the rate of link flapping still seems excessive. And how is "Reconnected" ever the explanation of a "Peer transient failure"? That seems fishy to me.

Also, I have noticed that after some time my node stops accepting incoming connections entirely. I thought it was happening due to a file descriptor mixup while running under Heaptrack, but it happened again even with no Heaptrack in the mix.

rustyrussell commented 2 years ago

These log messages are overzealous. I'm seeing the "Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died" mainly when peers don't respond to pings (which is a message you only get at DEBUG level, but should probably be INFO).

With 21 connected peers, I am seeing 36 of these in 24 hours (yeah, one peer responsible for over half).

The "Reconnected" message is when they reconnect to us and we have an already-live connection, so we abandon the old one in favor of the new; I've seen 3 of these in the same period.

I haven't seen the failure to accept incoming connections! That's weird...

whitslack commented 2 years ago

I haven't seen the failure to accept incoming connections!

I think it was due to connectd exceeding the file descriptor limit, even though my limit is set at 4096 and I have fewer than 1000 open channels. (I've lost so many channels since upgrading to 0.11.x.)

rustyrussell commented 2 years ago

fd leak? That should show up in ls -l proc/$(pidof lightning_connectd)/fd. Mine shows 59 after 5 days... I'll check with the debugger tomorrow to see if any are untracked.

whitslack commented 2 years ago

@rustyrussell: My lightning_connectd currently has 2341 open FDs.

whitslack commented 2 years ago

We should take the discussion of the potential FD leak over to #5353 and leave this issue for the CPU usage.

rustyrussell commented 2 years ago

I'm going to leave this PR open. Though it's mitigated for 0.12 in #5342 I know that CPU usage can be further significantly reduced. @whitslack is most likely to see this (the CPU usage will be on first connect), so I'm leaving it open, earmarked for next release.

rustyrussell commented 1 year ago

With the pile of recent v23.02 optimizations, this should be better. Please reopen if I'm wrong!

whitslack commented 1 year ago

Quoting @rustyrussell:

We do the dumbest possible thing here, which is to retry in commit_time_msec (10 by default).

If this ethos is prevalent throughout the codebase, then it starts to make sense why CLN is so CPU hungry. Maybe something similar is at play in lightning_connectd, which presently on my system has consumed 564 minutes of CPU time, far more than the 165 minutes for lightningd itself and 51 minutes for lightning_hsmd. My current daemon has only been running for about 23 hours of wall-clock time, so connectd is still averaging about 41% CPU utilization.

rustyrussell commented 1 year ago

That's fascinating: can you give me a perf report of what's happening?

$ perf record  --call-graph dwarf -p `pidof lightning_channeld`
<wait for a while then ^C>
$ perf report > perf.report

(I usually use perf report in tui mode, but the man page suggests it will do something useful if redirected!)

rustyrussell commented 1 year ago

Quoting @rustyrussell:

We do the dumbest possible thing here, which is to retry in commit_time_msec (10 by default).

If this ethos is prevalent throughout the codebase, then it starts to make sense why CLN is so CPU hungry.

Well, kind of? Premature optimization is the root of all evil, but complexity isn't free either.

whitslack commented 1 year ago

$ perf record  --call-graph dwarf -p `pidof lightning_channeld`
<wait for a while then ^C>
$ perf report > perf.report

@rustyrussell: Assuming you meant pidof lightning_connectd.

I let perf collect data until its perf.data was 4 GB in size.

I don't know what to make of the report, but here it is: perf.report

It looks like a lot of overhead in the poll syscall. Are you really not using epoll(7)?

# command ls -Fl "/proc/$(pidof lightning_connectd)/fd" | grep -Fc socket:
1515

Why does connectd own so many sockets? That's more than the number of open channels I have, let alone actually connected channels.

ElementsProject / lightning

connectd pegging 100% CPU #5301

Issue and Steps to Reproduce