RPC never becomes available after upgrade to v22.11

sethforprivacy commented 1 year ago

Issue and Steps to Reproduce

Upgrade to v22.11
Start lightningd (as a part of Docker Compose)
lightningd says its started and connected to bitcoind without issues
lightning-rpc is never available and refuses connections

`getinfo` output

sethforprivacy in 🌐 apps in apps on  master [!?] via  v16.18.1 took 1m46s ❯ docker exec -ti lightningd bash
root@e9b0fde18d67:/# lightning-cli getinfo
lightning-cli: Connecting to 'lightning-rpc': Connection refused
root@e9b0fde18d67:/#

debug logs

https://paste.sethforprivacy.com/?3e19275f8fcba674#8zLSy7EWEjsxaLV5CntSK93g6rJUEETvjL6JBjhN19sH

Config

# CLN general settings
alias=SethForPrivacy
rgb=F7484A
proxy=172.17.0.1:9050
bind-addr=0.0.0.0:9735
announce-addr=172.86.183.74:9735
addr=statictor:172.17.0.1:9051
network=bitcoin
experimental-dual-fund
experimental-offers
database-upgrade=true
autoclean-expiredinvoices-age=86400
autoclean-failedpays-age=86400
autoclean-failedforwards-age=86400
rpc-file=/root/.lightning/bitcoin/lightning-rpc

# Bitcoin RPC settings
bitcoin-rpcuser=REDACTED
bitcoin-rpcpassword=REDACTED
bitcoin-rpcconnect=REDACTED
bitcoin-rpcport=REDACTED

# Logging settings
log-level=debug

sethforprivacy commented 1 year ago

I can't downgrade now to v0.12.1 due to DB upgrade, so relatively urgent to find some solution to this as my node is down until some solution is found.

vincenzopalazzo commented 1 year ago

are you in a big node? and also an old node?

sethforprivacy commented 1 year ago

Nothing crazy about my node, and not that old. Have been running it since CLN v0.10.2 IIRC.

sethforprivacy commented 1 year ago

lightningd won't even create a lightning-rpc file in the given RPC path (or even try AFAICT):

rpc-file=/root/.lightning/bitcoin/lightning-rpc

sethforprivacy in 🌐 apps in apps on  master [!?] via  v16.18.1 ❯ ls -la .lightning/bitcoin
total 306448
drwx------ 2 sethforprivacy sethforprivacy      4096 Dec  2 15:15 .
drwx------ 4 sethforprivacy sethforprivacy      4096 Nov  5 18:57 ..
-rw-r--r-- 1 sethforprivacy sethforprivacy    143360 Nov 30 18:28 accounts.sqlite3
-rw-rw-r-- 1 sethforprivacy sethforprivacy        47 May 18  2022 backup.lock
-rw-r--r-- 1 root           root                 246 Nov  5 18:51 ca-key.pem
-rw-r--r-- 1 root           root                 572 Nov  5 18:51 ca.pem
-rw-r--r-- 1 root           root                 246 Nov  5 18:51 client-key.pem
-rw-r--r-- 1 root           root                 510 Nov  5 18:51 client.pem
-rw------- 1 root           root             7843616 Nov 30 18:11 crash.log.20221130181135
-r-------- 1 sethforprivacy sethforprivacy       735 Oct 21 16:24 emergency.recover
-rw------- 1 root           root            60748931 Dec  1 14:57 gossip_store
-rw------- 1 root           root            61559204 Nov 14 20:23 gossip_store.corrupt
-r-------- 1 sethforprivacy sethforprivacy        32 May 17  2022 hsm_secret
-r-------- 1 sethforprivacy sethforprivacy        32 May 17  2022 keys.clboss
-rw-r--r-- 1 sethforprivacy sethforprivacy 183439360 Dec  2 15:15 lightningd.sqlite3
-rw-r--r-- 1 root           root                 246 Nov  5 18:51 server-key.pem
-rw-r--r-- 1 root           root                 510 Nov  5 18:51 server.pem

Note that it previously still had the RPC file and was not working, but I deleted it to try and let lightningd recreate and noticed this second issue.

Testing different folder permissions now.

sethforprivacy commented 1 year ago

Tested and lightningd can successfully create/edit/manage files in the .lightning/bitcoin directory as expected.

grubles commented 1 year ago

Hm I wonder what the owner and permissions were for the lightning-rpc file that you mentioned was created at some point. Somehow you have root-owned files mixed in with user-owned files so maybe lightningd simply doesn't have the permissions to create lightning-rpc for some reason or another. If lightningd is in a container, how did the user-owned files end up there? Not a docker expert but I thought everything ran as root inside the container.

sethforprivacy commented 1 year ago

The files were pulled in from a non-Docker setup originally, thus some of the files being owned by another user.

I've already validated that lightningd can create files as and set permissions above, and the previous socket file was owned by root:root before I deleted it.

sethforprivacy commented 1 year ago

Trying to go back far enough to get logs (debug logs are a lot once things are connected..) but looks like it just started working out of nowhere ~3h after starting lightningd.

vincenzopalazzo commented 1 year ago

I think this is more a problem related to the Docker container instance, but I'm not an expert so maybe @cdecker has some more experience with this kind of stuff

rustyrussell commented 1 year ago

Wtf was it doing for 3 hours?

Random thought: is some plugin taking ages to start, and we're not correctly timing it out?

sethforprivacy commented 1 year ago

I'm not loading any custom plugins as you can see from the config, so not sure on that one...

Sadly by the time I noticed it had finally started I lost scrollback, will see if it was logging to disk as well and try to grab logs.

For at least the first couple of hours it just had the logs in the OP and feerate updates, nothing else.

rustyrussell commented 1 year ago

OK, I looked harder. For some reason, we took three hours to get the first block from bitcoind! I don't know what to say about that, TBH :(

vincenzopalazzo commented 1 year ago

OK, I looked harder. For some reason, we took three hours to get the first block from bitcoind! I don't know what to say about that, TBH :(

With docker this kind of delay was present also before, I usually clean up the docker image and recreate it from 0 (storing the .lightning in my local machine)

I have no idea how a system like docker works, but this does the trick for me. In addition I also found sometimes docker slow to answer to command like lightning-cli getinfo

cdecker commented 1 year ago

I'm a bit surprised lightningd didn't give up earlier and actually waited 3h for the first block to be retrieved. But let's investigate this a bit further. I usually start by making sure bitcoind and bitcoin-cli can work with the parameters passed to them. So a docker exec CID bitcoin-cli -rpcconnect=... getblockhash 0 can tell us that the connection to bitcoind from the container with CID as it's id. If that's already slow we found the culprit, if not it can be bcli or lightningd. We can exclude bcli by replacing it with something like trustedcoin, which relies on Blockchain explorers instead.

If that's still slow we have narrowed it down to something in lightningd, then we could take a look at syscalls using strace which can give us a rough idea on where we get stuck in lightningd

cdecker commented 1 year ago

Notice that very little here is docker specific, because I can't think of something that could have that impact (not a docker expert just a user)

sethforprivacy commented 1 year ago

No problems at all manually querying via bitcoin-cli:

sethforprivacy in 🌐 apps in apps on  master [!?] via  v16.18.1 ❯ time docker exec -ti lightningd bitcoin-cli -rpcconnect=100.64.0.4 -rpcport=8332 -rpcuser=REDACTED -rpcpassword=REDACTED getblockhash 0
000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f
docker exec -ti lightningd bitcoin-cli -rpcconnect=100.64.0.4 -rpcport=8332    0.02s user 0.02s system 15% cpu 0.270 total
sethforprivacy in 🌐 apps in apps on  master [!?] via  v16.18.1 ❯ time docker exec -ti lightningd bitcoin-cli -rpcconnect=100.64.0.4 -rpcport=8332 -rpcuser=REDACTED -rpcpassword=REDACTED getblockhash 0
000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f
docker exec -ti lightningd bitcoin-cli -rpcconnect=100.64.0.4 -rpcport=8332    0.01s user 0.03s system 16% cpu 0.236 total
sethforprivacy in 🌐 apps in apps on  master [!?] via  v16.18.1 ❯ time docker exec -ti lightningd bitcoin-cli -rpcconnect=100.64.0.4 -rpcport=8332 -rpcuser=REDACTED -rpcpassword=REDACTED getblockhash 0
000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f
docker exec -ti lightningd bitcoin-cli -rpcconnect=100.64.0.4 -rpcport=8332    0.01s user 0.03s system 19% cpu 0.200 total
sethforprivacy in 🌐 apps in apps on  master [!?] via  v16.18.1 ❯ time docker exec -ti lightningd bitcoin-cli -rpcconnect=100.64.0.4 -rpcport=8332 -rpcuser=REDACTED -rpcpassword=REDACTED getblockhash 0
000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f
docker exec -ti lightningd bitcoin-cli -rpcconnect=100.64.0.4 -rpcport=8332    0.02s user 0.01s system 15% cpu 0.218 total
sethforprivacy in 🌐 apps in apps on  master [!?] via  v16.18.1 ❯ time docker exec -ti lightningd bitcoin-cli -rpcconnect=100.64.0.4 -rpcport=8332 -rpcuser=REDACTED -rpcpassword=REDACTED getblockhash 0
000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f
docker exec -ti lightningd bitcoin-cli -rpcconnect=100.64.0.4 -rpcport=8332    0.03s user 0.02s system 21% cpu 0.198 total

Note that I am using a remote bitcoind that is reachable via Tailscale, latency is ~40ms.

I have never had an issue remotely like this, either Dockerized or not before I updated to v22.11. lightningd has been responsive and started up quickly until this update, and now suddenly I get this massive 3h delay at startup.

rustyrussell commented 1 year ago

OK, so to be precise, we see this in logs:

lightningd  | 2022-12-01T17:21:14.088Z DEBUG   lightningd: io_break: plugin_config_cb
lightningd  | 2022-12-01T17:21:14.088Z DEBUG   lightningd: io_loop_with_timers: config_plugin
lightningd  | 2022-12-01T17:21:14.088Z DEBUG   lightningd: All Bitcoin plugin commands registered

Then nothing but complaints about feerates at least through 2022-12-01T17:29:27.863Z (8 minutes later).

In particular, once it's successfully done chain setup, it would print "io_break: maybe_completed_init".

chain setup (ignoring weird cases which would have logged) is:

Call getchaininfo
Call getrawblockbyheight (plugin calls getblockhash, then getrawblock).

Once that's returned, and the feerate is initialized (which logs show happens after 5 seconds), we expect completion (and that "io_break: maybe_completed_init" message).

But, if it's really bitcoind going out to lunch, you'd expect to see an UNUSUAL log message, like this:

"UNUSUAL plugin-bcli: bitcoin-cli: finished /usr/local/bin/bitcoin-cli getblock 000000000000000000013c11f46c0dfb6926bc1d6662575d24a226767325fbec 0 (18446744073709550918 ms)"

Do you have one of those (grep 'UNUSUAL plugin-bcli' in your logs may get something)?

sethforprivacy commented 1 year ago

@rustyrussell I only have one of those, but I don't have logs going back to the start of lightningd here:

lightningd  | 2022-12-05T09:27:49.697Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -rpcconnect=100.64.0.4 -rpcport=8332 -rpcuser=... -stdinrpcpass estimatesmartfee 12 ECONOMICAL (130657 ms)

I'm restarting lightningd now to test and see what happens and be sure I grab logs as it happens.

sethforprivacy commented 1 year ago

On restart things functioned as normal:

https://paste.sethforprivacy.com/?efc9732f14ac50b3#8tarjt8zhZKESrMLnqsUzTzLWx3isbvnaw8TvH9FHxfk

No idea why previously it wouldn't start for ~3h on the dot and then would function perfectly, nothing has changed with bitcoind used here, it hasn't even been restarted.

NicolasDorier commented 1 year ago

I don't know if that's the same issue as here. I had a public node that I use to try lightningd update just before pushing those to all users of BTCPay Server docker deployment.

This node was offline for a big amount of time. (and running bitcoin pruned node) When I updated from 0.10.2 to 22.11, the database update ran, the node was running normally, no error in the logs but getinfo was showing "blockheight": 0, it was stuck there, and I expect it was caused because my node was offline for too long, and maybe couldn't sync as blocks were pruned.

So I deleted all of my node's data, then try again to run a 0.10.2, followed by the update. This time it worked.

I am unsure if my problem was related to this issue, just reporting in case.

cdecker commented 1 year ago

The blockheight: 0 thing is normal until we have caught up with the blockchain (gossipd uses the blockheight in getinfo to trigger some additional verification). So the problem here is that either bitcoind is taking a really long time to return the first block (long time no logs, just waiting before even the Server started line shows up), or we are syncing with the blockchain (logs saying Adding block show if --log-level=debug).

I'm tempted to say that @NicolasDorier ran into the latter, while @sethforprivacy seems to have triggered an issue where we silently lost the connection to bitcoind and it just took forever to realize that.

sethforprivacy commented 1 year ago

Only thing I have to add at this point was that I was running v0.23.0 of bitcoind at the time of the bug, but have updated today to v0.24.0 without issues.

I still haven't had the bug reoccur after several lightningd restarts to try and trigger it. Sadly a hard one to reproduce :(

rustyrussell commented 1 year ago

You can see from your logs that bitcoind took 130 seconds to respond at one point. That's weird....

sethforprivacy commented 1 year ago

It is running on a relatively low powered NAS (still amd64, but an i3 w/ 20GB RAM + standard hard drives, not SSD/NVMe).

rustyrussell commented 1 year ago

I'm closing this as unreproducible for now. Feel free to reopen if we get more information?

ElementsProject / lightning