Closed sethforprivacy closed 1 year ago
I can't downgrade now to v0.12.1 due to DB upgrade, so relatively urgent to find some solution to this as my node is down until some solution is found.
are you in a big node? and also an old node?
Nothing crazy about my node, and not that old. Have been running it since CLN v0.10.2 IIRC.
lightningd won't even create a lightning-rpc file in the given RPC path (or even try AFAICT):
rpc-file=/root/.lightning/bitcoin/lightning-rpc
sethforprivacy in 🌐 apps in apps on master [!?] via v16.18.1 ❯ ls -la .lightning/bitcoin
total 306448
drwx------ 2 sethforprivacy sethforprivacy 4096 Dec 2 15:15 .
drwx------ 4 sethforprivacy sethforprivacy 4096 Nov 5 18:57 ..
-rw-r--r-- 1 sethforprivacy sethforprivacy 143360 Nov 30 18:28 accounts.sqlite3
-rw-rw-r-- 1 sethforprivacy sethforprivacy 47 May 18 2022 backup.lock
-rw-r--r-- 1 root root 246 Nov 5 18:51 ca-key.pem
-rw-r--r-- 1 root root 572 Nov 5 18:51 ca.pem
-rw-r--r-- 1 root root 246 Nov 5 18:51 client-key.pem
-rw-r--r-- 1 root root 510 Nov 5 18:51 client.pem
-rw------- 1 root root 7843616 Nov 30 18:11 crash.log.20221130181135
-r-------- 1 sethforprivacy sethforprivacy 735 Oct 21 16:24 emergency.recover
-rw------- 1 root root 60748931 Dec 1 14:57 gossip_store
-rw------- 1 root root 61559204 Nov 14 20:23 gossip_store.corrupt
-r-------- 1 sethforprivacy sethforprivacy 32 May 17 2022 hsm_secret
-r-------- 1 sethforprivacy sethforprivacy 32 May 17 2022 keys.clboss
-rw-r--r-- 1 sethforprivacy sethforprivacy 183439360 Dec 2 15:15 lightningd.sqlite3
-rw-r--r-- 1 root root 246 Nov 5 18:51 server-key.pem
-rw-r--r-- 1 root root 510 Nov 5 18:51 server.pem
Note that it previously still had the RPC file and was not working, but I deleted it to try and let lightningd recreate and noticed this second issue.
Testing different folder permissions now.
Tested and lightningd can successfully create/edit/manage files in the .lightning/bitcoin
directory as expected.
Hm I wonder what the owner and permissions were for the lightning-rpc
file that you mentioned was created at some point. Somehow you have root-owned files mixed in with user-owned files so maybe lightningd
simply doesn't have the permissions to create lightning-rpc
for some reason or another. If lightningd
is in a container, how did the user-owned files end up there? Not a docker expert but I thought everything ran as root inside the container.
The files were pulled in from a non-Docker setup originally, thus some of the files being owned by another user.
I've already validated that lightningd can create files as and set permissions above, and the previous socket file was owned by root:root before I deleted it.
Trying to go back far enough to get logs (debug logs are a lot once things are connected..) but looks like it just started working out of nowhere ~3h after starting lightningd.
I think this is more a problem related to the Docker container instance, but I'm not an expert so maybe @cdecker has some more experience with this kind of stuff
Wtf was it doing for 3 hours?
Random thought: is some plugin taking ages to start, and we're not correctly timing it out?
I'm not loading any custom plugins as you can see from the config, so not sure on that one...
Sadly by the time I noticed it had finally started I lost scrollback, will see if it was logging to disk as well and try to grab logs.
For at least the first couple of hours it just had the logs in the OP and feerate updates, nothing else.
OK, I looked harder. For some reason, we took three hours to get the first block from bitcoind! I don't know what to say about that, TBH :(
OK, I looked harder. For some reason, we took three hours to get the first block from bitcoind! I don't know what to say about that, TBH :(
With docker this kind of delay was present also before, I usually clean up the docker image and recreate it from 0 (storing the .lightning in my local machine)
I have no idea how a system like docker works, but this does the trick for me. In addition I also found sometimes docker slow to answer to command like lightning-cli getinfo
I'm a bit surprised lightningd didn't give up earlier and actually waited 3h for the first block to be retrieved. But let's investigate this a bit further. I usually start by making sure bitcoind
and bitcoin-cli
can work with the parameters passed to them. So a docker exec CID bitcoin-cli -rpcconnect=... getblockhash 0
can tell us that the connection to bitcoind
from the container with CID as it's id. If that's already slow we found the culprit, if not it can be bcli
or lightningd
. We can exclude bcli
by replacing it with something like trustedcoin, which relies on Blockchain explorers instead.
If that's still slow we have narrowed it down to something in lightningd
, then we could take a look at syscalls using strace which can give us a rough idea on where we get stuck in lightningd
Notice that very little here is docker specific, because I can't think of something that could have that impact (not a docker expert just a user)
No problems at all manually querying via bitcoin-cli:
sethforprivacy in 🌐 apps in apps on master [!?] via v16.18.1 ❯ time docker exec -ti lightningd bitcoin-cli -rpcconnect=100.64.0.4 -rpcport=8332 -rpcuser=REDACTED -rpcpassword=REDACTED getblockhash 0
000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f
docker exec -ti lightningd bitcoin-cli -rpcconnect=100.64.0.4 -rpcport=8332 0.02s user 0.02s system 15% cpu 0.270 total
sethforprivacy in 🌐 apps in apps on master [!?] via v16.18.1 ❯ time docker exec -ti lightningd bitcoin-cli -rpcconnect=100.64.0.4 -rpcport=8332 -rpcuser=REDACTED -rpcpassword=REDACTED getblockhash 0
000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f
docker exec -ti lightningd bitcoin-cli -rpcconnect=100.64.0.4 -rpcport=8332 0.01s user 0.03s system 16% cpu 0.236 total
sethforprivacy in 🌐 apps in apps on master [!?] via v16.18.1 ❯ time docker exec -ti lightningd bitcoin-cli -rpcconnect=100.64.0.4 -rpcport=8332 -rpcuser=REDACTED -rpcpassword=REDACTED getblockhash 0
000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f
docker exec -ti lightningd bitcoin-cli -rpcconnect=100.64.0.4 -rpcport=8332 0.01s user 0.03s system 19% cpu 0.200 total
sethforprivacy in 🌐 apps in apps on master [!?] via v16.18.1 ❯ time docker exec -ti lightningd bitcoin-cli -rpcconnect=100.64.0.4 -rpcport=8332 -rpcuser=REDACTED -rpcpassword=REDACTED getblockhash 0
000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f
docker exec -ti lightningd bitcoin-cli -rpcconnect=100.64.0.4 -rpcport=8332 0.02s user 0.01s system 15% cpu 0.218 total
sethforprivacy in 🌐 apps in apps on master [!?] via v16.18.1 ❯ time docker exec -ti lightningd bitcoin-cli -rpcconnect=100.64.0.4 -rpcport=8332 -rpcuser=REDACTED -rpcpassword=REDACTED getblockhash 0
000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f
docker exec -ti lightningd bitcoin-cli -rpcconnect=100.64.0.4 -rpcport=8332 0.03s user 0.02s system 21% cpu 0.198 total
Note that I am using a remote bitcoind that is reachable via Tailscale, latency is ~40ms.
I have never had an issue remotely like this, either Dockerized or not before I updated to v22.11. lightningd has been responsive and started up quickly until this update, and now suddenly I get this massive 3h delay at startup.
OK, so to be precise, we see this in logs:
lightningd | 2022-12-01T17:21:14.088Z DEBUG lightningd: io_break: plugin_config_cb
lightningd | 2022-12-01T17:21:14.088Z DEBUG lightningd: io_loop_with_timers: config_plugin
lightningd | 2022-12-01T17:21:14.088Z DEBUG lightningd: All Bitcoin plugin commands registered
Then nothing but complaints about feerates at least through 2022-12-01T17:29:27.863Z (8 minutes later).
In particular, once it's successfully done chain setup, it would print "io_break: maybe_completed_init".
chain setup (ignoring weird cases which would have logged) is:
Once that's returned, and the feerate is initialized (which logs show happens after 5 seconds), we expect completion (and that "io_break: maybe_completed_init" message).
But, if it's really bitcoind going out to lunch, you'd expect to see an UNUSUAL log message, like this:
"UNUSUAL plugin-bcli: bitcoin-cli: finished /usr/local/bin/bitcoin-cli getblock 000000000000000000013c11f46c0dfb6926bc1d6662575d24a226767325fbec 0 (18446744073709550918 ms)"
Do you have one of those (grep 'UNUSUAL plugin-bcli' in your logs may get something)?
@rustyrussell I only have one of those, but I don't have logs going back to the start of lightningd here:
lightningd | 2022-12-05T09:27:49.697Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -rpcconnect=100.64.0.4 -rpcport=8332 -rpcuser=... -stdinrpcpass estimatesmartfee 12 ECONOMICAL (130657 ms)
I'm restarting lightningd now to test and see what happens and be sure I grab logs as it happens.
On restart things functioned as normal:
https://paste.sethforprivacy.com/?efc9732f14ac50b3#8tarjt8zhZKESrMLnqsUzTzLWx3isbvnaw8TvH9FHxfk
No idea why previously it wouldn't start for ~3h on the dot and then would function perfectly, nothing has changed with bitcoind used here, it hasn't even been restarted.
I don't know if that's the same issue as here. I had a public node that I use to try lightningd update just before pushing those to all users of BTCPay Server docker deployment.
This node was offline for a big amount of time. (and running bitcoin pruned node)
When I updated from 0.10.2 to 22.11, the database update ran, the node was running normally, no error in the logs but getinfo
was showing "blockheight": 0,
it was stuck there, and I expect it was caused because my node was offline for too long, and maybe couldn't sync as blocks were pruned.
So I deleted all of my node's data, then try again to run a 0.10.2, followed by the update. This time it worked.
I am unsure if my problem was related to this issue, just reporting in case.
The blockheight: 0
thing is normal until we have caught up with the blockchain (gossipd
uses the blockheight in getinfo
to trigger some additional verification). So the problem here is that either bitcoind
is taking a really long time to return the first block (long time no logs, just waiting before even the Server started
line shows up), or we are syncing with the blockchain (logs saying Adding block
show if --log-level=debug
).
I'm tempted to say that @NicolasDorier ran into the latter, while @sethforprivacy seems to have triggered an issue where we silently lost the connection to bitcoind
and it just took forever to realize that.
Only thing I have to add at this point was that I was running v0.23.0 of bitcoind at the time of the bug, but have updated today to v0.24.0 without issues.
I still haven't had the bug reoccur after several lightningd restarts to try and trigger it. Sadly a hard one to reproduce :(
You can see from your logs that bitcoind took 130 seconds to respond at one point. That's weird....
It is running on a relatively low powered NAS (still amd64, but an i3 w/ 20GB RAM + standard hard drives, not SSD/NVMe).
I'm closing this as unreproducible for now. Feel free to reopen if we get more information?
Issue and Steps to Reproduce
getinfo
outputdebug logs
https://paste.sethforprivacy.com/?3e19275f8fcba674#8zLSy7EWEjsxaLV5CntSK93g6rJUEETvjL6JBjhN19sH
Config