ElementsProject / lightning

Core Lightning — Lightning Network implementation focusing on spec compliance and performance
Other
2.85k stars 901 forks source link

Abort due to RPC queue exhaustion in `bitcoind` #3905

Open ghost opened 4 years ago

ghost commented 4 years ago

Issue and Steps to Reproduce

C-lightning (v0.9.0-2-ge7d89cd) crashed while I wasn't doing anything and it was just running on my server: C-lightning ran fine again after a restart.

Here is the relevant crash log:

+353275.352180714 gossipdDEBUG: Ignoring redundant update for 541045x1888x0/1 (last 1596353997, now 1596613197)
+353275.352518922 gossipdDEBUG: Received channel_update for channel 563904x1769x0/0 now ACTIVE
+353275.352913631 gossipdDEBUG: Ignoring redundant update for 563904x1769x0/0 (last 1596353997, now 1596613197)
+353275.353255043 gossipdDEBUG: Received channel_update for channel 632842x2096x0/1 now ACTIVE
+353275.353595650 gossipdDEBUG: Ignoring redundant update for 632842x2096x0/1 (last 1596281411, now 1596612611)
+353275.353648295 gossipdDEBUG: Received channel_update for channel 558820x1513x0/1 now ACTIVE
+353275.353690830 gossipdDEBUG: Ignoring redundant update for 558820x1513x0/1 (last 1596598210, now 1596612607)
+353275.353731056 gossipdDEBUG: Received channel_update for channel 600602x390x0/1 now DISABLED
+353275.353771566 gossipdDEBUG: Ignoring redundant update for 600602x390x0/1 (last 1596267449, now 1596613049)
+353275.353811632 gossipdDEBUG: Received channel_update for channel 603212x99x0/1 now ACTIVE
+353275.353866916 gossipdDEBUG: Ignoring redundant update for 603212x99x0/1 (last 1596260309, now 1596613109)
+353275.478688276 plugin-bcliIO_IN: 7b226a736f6e727063223a22322e30222c226d6574686f64223a226c6f67222c22706172616d73223a7b226c6576656c223a227761726e222c226d657373616765223a22626974636f696e2d636c6920676574626c6f636b6861736820363432333035206578697465642077697468207374617475732031227d207d0a0a
+353275.478713571 plugin-bcliUNUSUAL: bitcoin-cli getblockhash 642305 exited with status 1
+353276.639984311 plugin-bcliINFO: Killing plugin: Plugin exited before completing handshake.
+353276.640058453 lightningdBROKEN: The Bitcoin backend died.
+353277.452120031 lightningdBROKEN: FATAL SIGNAL 6 (version v0.9.0-2-ge7d89cd)
+353277.452155928 lightningdBROKEN: backtrace: common/daemon.c:43 (send_backtrace) 0x55ed94e39fc7
+353277.452169622 lightningdBROKEN: backtrace: common/daemon.c:51 (crashdump) 0x55ed94e3a012
+353277.452180698 lightningdBROKEN: backtrace: (null):0 ((null)) 0x7f98b9393f1f
+353277.452192168 lightningdBROKEN: backtrace: (null):0 ((null)) 0x7f98b9393e97
+353277.452201490 lightningdBROKEN: backtrace: (null):0 ((null)) 0x7f98b9395800
+353277.452211656 lightningdBROKEN: backtrace: lightningd/log.c:820 (fatal) 0x55ed94e1aff0
+353277.452222138 lightningdBROKEN: backtrace: lightningd/bitcoind.c:42 (bitcoin_destructor) 0x55ed94e07ae1
+353277.452232451 lightningdBROKEN: backtrace: ccan/ccan/tal/tal.c:240 (notify) 0x55ed94e88abc
+353277.452242385 lightningdBROKEN: backtrace: ccan/ccan/tal/tal.c:402 (del_tree) 0x55ed94e88b20
+353277.452271164 lightningdBROKEN: backtrace: ccan/ccan/tal/tal.c:486 (tal_free) 0x55ed94e8907d
+353277.452280448 lightningdBROKEN: backtrace: lightningd/plugin.c:217 (plugin_kill) 0x55ed94e2f7d3
+353277.452289265 lightningdBROKEN: backtrace: lightningd/plugin.c:526 (plugin_conn_finish) 0x55ed94e2f7ed
+353277.452299098 lightningdBROKEN: backtrace: ccan/ccan/io/poll.c:244 (destroy_conn) 0x55ed94e80182
+353277.452308039 lightningdBROKEN: backtrace: ccan/ccan/io/poll.c:250 (destroy_conn_close_fd) 0x55ed94e801a4
+353277.452317034 lightningdBROKEN: backtrace: ccan/ccan/tal/tal.c:240 (notify) 0x55ed94e88abc
+353277.452325753 lightningdBROKEN: backtrace: ccan/ccan/tal/tal.c:402 (del_tree) 0x55ed94e88b20
+353277.452334406 lightningdBROKEN: backtrace: ccan/ccan/tal/tal.c:486 (tal_free) 0x55ed94e8907d
+353277.452363384 lightningdBROKEN: backtrace: ccan/ccan/io/io.c:450 (io_close) 0x55ed94e7edac
+353277.452393787 lightningdBROKEN: backtrace: ccan/ccan/io/poll.c:449 (io_loop) 0x55ed94e8076c
+353277.452403540 lightningdBROKEN: backtrace: lightningd/io_loop_with_timers.c:24 (io_loop_with_timers) 0x55ed94e158aa
+353277.452413967 lightningdBROKEN: backtrace: lightningd/lightningd.c:997 (main) 0x55ed94e18f4e
+353277.452422762 lightningdBROKEN: backtrace: (null):0 ((null)) 0x7f98b9376b96
+353277.452431638 lightningdBROKEN: backtrace: (null):0 ((null)) 0x55ed94e079d9
+353277.452439506 lightningdBROKEN: backtrace: (null):0 ((null)) 0xffffffffffffffff

getinfo output

"version": "v0.9.0-2-ge7d89cd",

ghost commented 4 years ago

@d4amenace No. Indeed, bitcoind was the first thing I checked and it was running fine, still having around the maximum number of connections. Also my Electrum Server was still running.

vincenzopalazzo commented 4 years ago

@sumBTC when you restart the node, the problem not longer exists right?

Is possible to have a piece of bitcoind log, if you have? I'm curious to know how the bitcoind answered from the gethashblock request

I'm my opinion, c-lightning receive some unexpected answer from bitcoind, I say this because I see this log line inside your original message

+353275.478713571 plugin-bcliUNUSUAL: bitcoin-cli getblockhash 642305 exited with status 1

ghost commented 4 years ago

@vincenzopalazzo Hi palazzone, you can find my bitcoind debug.log here

EDIT: I looked at it myself and see a lot of "WARNING: request rejected because http work queue depth exceeded, it can be increased with the -rpcworkqueue= setting"

Don't know what it means

Edit:

when you restart the node, the problem not longer exists right?

right

ZmnSCPxj commented 4 years ago

We should not reach the work queue depth, since we specifically throttle our requests to bitcoind to avoid that.

Unless you have multiple services talking to bitcoind, not just lightningd... do you?

ghost commented 4 years ago

I also run ElectrumX (public Electrum Server) and another c-lightning (test) node only connected (only a channel to) to my main node (same version) and Electrum Personal Server

EDIT: my test c-lightning node didn't crash

vincenzopalazzo commented 4 years ago

From @ZmnSCPxj answer and your log, I can understand that for your c-lightning on mainet crashed because your network configuration with c-lightning and ElectrumX server make an attack like "Zombie attack on the network" in other words, the request made to bitcoind is too much and bitcoind said with the log "WARNING: request rejected because http work queue depth exceeded, it can be increased with the -rpcworkqueue= setting" Wait I have some work to do before receiving other requests.

You can increate the dimension of the queue how bitcoind suggests with the log.

P.S: c-lightning on testnet didn't crash because the testnet queue and mainet queue is different.

Another P.S: This is only my vision.

ghost commented 4 years ago

@vincenzopalazzo The question is of course should c-lightning crash for that reason?

BTW my other node is a test node (for software updates and other use), not a testnet node, it also runs on mainnet. Probably it wasn't making any requests.

ZmnSCPxj commented 4 years ago

Probably it wasn't making any requests.

It was. We have a loop every 10 seconds30 seconds or so that polls the bitcoind for a new block, even if we have no channels or anything. With enough activity from your active node, the testing node, and ElectrumX, it could indeed periodically overload the queue depth.

That said... my understanding is that bcli should still not crash then, it should keep retrying. @darosior ?

darosior commented 4 years ago

We should definitely be clearer in the documentation with our use of bitcoind's rpc threads. It is quite picky and will refuse any concurrent request made above the -rpcthreads (defaults to 4) threshold (and sometimes silently iirc).

That said... my understanding is that bcli should still not crash then, it should keep retrying. @darosior ?

It should, under a certain treshold (1 minute).

As the log line preceding the death of bcli tells, we've surely been retrying for 1 minute however we crash before telling you so.. (edit: reproduced, that's it)

My guess is that electrumx has been heavy loading your poor bitcoind :-( (edit: ok it's definitely this given your debug.log ...).

@sumBTC (and anyone reading this): you should really increase the rpcthreads configuration of your bitcoind if you run another application than lightningd on top of it. Regarding the crash, i'll look into it..

ghost commented 4 years ago

@darosior I will. Thanks!

darosior commented 4 years ago

Ok, I've reproduced this locally so to sum up:

ZmnSCPxj commented 4 years ago

@darosior nice! We might want to mention -rpcthreads on README.md

If you have more than one `lightningd` instance using a single `bitcoind`, make sure to
increase `-rpcthreads` for your `bitcoind`.
Each `lightningd` instance consumes up to 4 RPC threads.
If you run other services on top of the same `bitcoind` as well, you will want to allocate
more `-rpcthreads` as well;
4 per service is a reasonable worst-case, but you should consult the documentation of
that service, or ask its developers.
d4amenace commented 4 years ago
If you have more than one `lightningd` instance using a single `bitcoind`, make sure to
increase `-rpcthreads` for your `bitcoind`.
Each `lightningd` instance consumes up to 4 RPC threads.
If you run other services on top of the same `bitcoind` as well, you will want to allocate
more `-rpcthreads` as well;
4 per service is a reasonable worst-case, but you should consult the documentation of
that service, or ask its developers.

That's good to know! I actually have multiple lightningd's (20) running under a single bitcoind. Is there a maximum number of rpcthreads that a single bitcoind can handle?

ghost commented 4 years ago

@ZmnSCPxj The warning in the bitcoind debug.log suggests to increase via -rpcworkqueue while here you suggest to increase via -rpcthreads. Maybe the difference should be explained or maybe both should be set?

EDIT:

Found this in src/httpserver.h of the Bitcoin (0.20.0) software

static const int DEFAULT_HTTP_THREADS=4;
static const int DEFAULT_HTTP_WORKQUEUE=16;

I'm probably going to set the following in bitcoin.conf

rpcthreads=16 rpcworkqueue=64

so times 4.

darosior commented 4 years ago

@d4amenace depends on your machine, bitcoind will spawn any number of std::thread() threads that you tell it to.

@ZmnSCPxj : yeah, but i checked the FAQ since my last comment and there is already a section about it. Do you think it's still worth including it in the README ? Or maybe we should include a "please quickly skim through the FAQ before running" :)

@sumBTC : bitcoind will spawn rpcthreads number of queue, each having a depth of rpcworkqueue. Regarding your config, i think you should also and especially check that electrumx (and the others) are not filling all the available space ("Oh ! There is more space now, let's fill the queue with even more requests!"). Lightningd will never ever use more than 4 threads.

ghost commented 4 years ago

@darosior Yeah, running electrumx is always challenging, especially during DDOS attacks. For example, electrumx gets at maximum 30% of my cpu (using: cpulimit) or it would practically crash my server during DDOS attacks. Nevertheless, running a public electrum server is important, also for lightning as many lightning wallets need it. I will have a look into how electrumx handles threads.