ElementsProject / lightning

Core Lightning — Lightning Network implementation focusing on spec compliance and performance
Other
2.81k stars 889 forks source link

connectd hanging while being unable to connect to peers #7462

Closed grubles closed 1 month ago

grubles commented 1 month ago

Running master at 029034a71bd6b7506b9e921ffa94d722bbe0424a. CLN can't connect to any peers and lightning_connectd seems to hang at 100% CPU.

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                     
 567272 test      20   0   60672  49152  45056 R 100.0   0.1   2:17.99 lightning_connectd

When trying to shut down CLN, lightningd and lightning_connectd both hang at 100% CPU each and the stop command also hangs indefinitely. I have to use kill to end those processes in able to restart CLN.

There doesn't seem to be anything useful in debug.log to share.

CLN config includes:

experimental-dual-fund
experimental-splicing
experimental-offers
experimental-peer-storage
experimental-quiesce

On a different machine, I am able to reproduce this without those experimental config options.

kilrau commented 1 month ago

So looks like https://github.com/ElementsProject/lightning/pull/7365 didn't fix anything for you...

grubles commented 1 month ago

This is more of a medium-sized node and wasn't running into the CPU usage that PR addresses, so I'm not sure. Also the other machine I tested on has a single signet channel and was experiencing the issue described above.

kilrau commented 1 month ago

OK sth different then, we'll go and test #7365

michael1011 commented 1 month ago

I can reproduce that problem on a fresh, new node:

  1. Create a new node with latest master
  2. Connect to some peers
  3. Watch connectd spike to 100% CPU

For convenience to reproduce this I created a little script. Run lightning-cli listnodes > nodes.json and then this python script to connect to some nodes and you'll see connectd go wild:

#!/usr/bin/env python3
import json
import subprocess

with open('nodes.json') as f:
    nodes = json.load(f)["nodes"]

print(f"Got {len(nodes)} nodes")

with_address = []

for node in nodes:
    if "addresses" not in node or len(node["addresses"]) == 0:
        continue

    with_address.append(node)

print(f"{len(with_address)} with address")

ipv4 = []

for node in with_address:
    for address in node["addresses"]:
        if address["type"] != "ipv4":
            continue

        ipv4.append(f"{node['nodeid']}@{address['address']}:{address['port']}")

print(f"{len(ipv4)} with IPV4 address")

for (i, address) in enumerate(ipv4):
    print(f"Connecting to {i+1}/{len(ipv4)}: {address}")
    res = subprocess.Popen(
        f"timeout 10 lightning-cli connect {address}",
        shell=True, 
        stdout=subprocess.PIPE,
    ).stdout.read()
    try:
        print(json.dumps(
            json.loads(res),
            indent=4,
        ))
    except:
        print("Connect timed out")

Edit:

This is definitely a regression since v24.05. I created a new node with v24.05 and ran the script; it was just fine. Updated to master, ran it again and connectd jumped to 100% CPU before it even connected to the first peer.

image

hMsats commented 1 month ago

Can confirm the original post.

Channel main node <-> test node

V24.05 <-> V24.05 no problems

V24.05 <-> Master same problems

When I return to v25.05 everything is fine again

hMsats commented 1 month ago

Added the one line in the pull request into gossmap.c and it solved the issue for me!