Closed darosior closed 4 years ago
Why not fold this into #3213? Make bcli
(and all our other builtin plugins) a "crucial" plugin which, if it is not loaded and initialized properly, we just die.
Then to replace any builtin plugin we make it non-crucial and suppress loading it and then make a new crucial plugin, e.g. custom-bcli
. E.g. --non-crucial-plugin=bcli --plugin-suppress=bcli --crucial-plugin=custom-bcli
. This would work for all plugins, e.g. pay
, autoclean
....
A couple of weeks ago, I discovered that my Lightning node had not been interacting with the blockchain for over a day because bcli
had failed. The only way I noticed was that I had a channel that had been in CHANNELD_AWAITING_LOCKIN
state for hours even though the peer was connected and the funding transaction was many blocks deep in the chain. When I examined the process tree, I noticed that bcli
was in the defunct/zombie state. If this condition had continued for longer than my their_to_self_delay
in any of my channels, my peers could have successfully cheated me. Obviously no good! So I appreciate the move toward making bcli
a "crucial" plugin, without which lightningd
will terminate.
Some logs would be very welcome as we would die on the next request (every 30s at most) if the connection to bcli
is dropped. So bcli
was in zombie state but still kept its connection opened ?
bcli
had been in zombie state (process exited but exit status not reaped by parent) for over 24 hours. I don't know why lightningd
didn't notice, but the net effect was that nothing in my Lightning node that needs blockchain interaction (e.g., opening and closing channels) was making any progress. I assume if a funding transaction were to have been spent during this time (i.e., peer initiated unilateral close), my node would not have noticed it.
If it ever happens again, I'll have logs to submit, as I'm now running lightningd
with debug logging enabled.
It happened again. Here's an abbreviated process tree.
PID USER GROUP START TIME RSS NI NLWP COMMAND
1 root root Apr07 00:00:23 20 0 1 init [3]
2948 lightni+ lightni+ Apr07 11:23:10 613580 0 1 /usr/bin/lightningd --conf=/etc/lightning/lightningd.conf --lightning-dir=/var/lib/lightning --log-file=/var/log/lightningd.log --pid-file=/var/lib/lightning/lightningd.pid --rpc-file=/var/lib/lightning/lightningd.rpc --daemon
2949 lightni+ lightni+ Apr07 00:00:00 0 0 1 /usr/bin/../libexec/c-lightning/plugins/pay
2950 lightni+ lightni+ Apr07 00:17:22 0 0 1 [bcli] <defunct>
2951 lightni+ lightni+ Apr07 00:00:00 0 0 1 /usr/bin/../libexec/c-lightning/plugins/autoclean
2952 lightni+ lightni+ Apr07 00:00:00 0 0 1 /usr/bin/../libexec/c-lightning/plugins/fundchannel
2966 lightni+ lightni+ Apr07 01:08:12 83132 0 1 /usr/libexec/c-lightning/lightning_hsmd
2971 lightni+ lightni+ Apr07 00:20:30 1856 0 1 /usr/libexec/c-lightning/lightning_connectd
2972 lightni+ lightni+ Apr07 12:14:52 143584 0 1 /usr/libexec/c-lightning/lightning_gossipd
(I have hundreds of lightning_channeld
and lightning_openingd
processes, which I filtered out of the process tree shown above.)
Of note is the bcli
process, which has gone defunct. The last block that lightningd knows about is 627799, which was mined 12 hours ago. My log shows when lightningd learned of this block:
2020-04-27T05:01:46.444Z DEBUG lightningd: Adding block 627799: 0000000000000000000bf07b961efe8779ccdd87454f2eeb20ef1baad92a2072
That's the last Adding block
message my log contains. I do have many subsequent messages like this one:
2020-04-27T06:13:30.925Z DEBUG 032c4b954f0f171b694b5e8e8323589e54196b48cf2efc27692513a360cb11d76f-gossipd: Ignoring future channel_announcment for 627803x1839x1 (current block 627799)
There are no messages containing bcli
after Adding block 627799
.
My bitcoind
process has been running uninterrupted the entire time and is still responding to RPC requests as I write this report. My kernel log shows that none of C-Lightning's processes were killed by the OOM killer. The OOM killer did kill some Chrome processes starting at 01:26, but that was well after block 627800 was mined at 01:17, and bcli
crashed before block 627800, so I think that's unrelated.
I do see exactly two groups of memory-related messages like this:
2020-04-27T06:32:38.176Z UNUSUAL lightningd: subd lightning_channeld failed: Cannot allocate memory
2020-04-27T06:32:38.177Z UNUSUAL 03b1be68b8f564fe53f5456cf4bec901ba968fb72ed7291279ab0c87e0d22f1f49-chan#41206: Could not subdaemon channel: Cannot allocate memory
2020-04-27T06:32:38.177Z INFO 03b1be68b8f564fe53f5456cf4bec901ba968fb72ed7291279ab0c87e0d22f1f49-chan#41206: Peer transient failure in CHANNELD_NORMAL: Failed to subdaemon channel
So clearly my system was running low on available memory this morning. (I was compiling Chromium overnight, which is what triggered the OOM killer on my chrome
processes. The build completed successfully.)
Maybe it's possible that the bcli
plugin process exited due to a failure to allocate memory but didn't log any message about it? In any case, lightningd
seems completely unaware that bcli
has exited.
I am going to restart my C-Lightning now, but I can grep my log for anything you want.
Could you grep for bcli
please ?
Anyhow this happened, I think we should register SIGSHLD and die if bcli
died will submit a patch.
Could you grep for
bcli
please ?
@darosior: Here's a grep of the log from midnight (well before the failure) until I restarted C-Lightning.
# grep -a '^2020-04-27T.*bcli' /var/log/lightningd.log | sed -n '0,/started/p'
2020-04-27T00:00:40.297Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627774 (23468 ms)
2020-04-27T00:17:00.390Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627775 (25481 ms)
2020-04-27T00:18:41.818Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627775 (11355 ms)
2020-04-27T00:23:13.474Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627776 (40261 ms)
2020-04-27T00:39:19.749Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627777 (15651 ms)
2020-04-27T00:41:41.165Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627778 (11148 ms)
2020-04-27T00:43:28.371Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627778 (17144 ms)
2020-04-27T00:44:21.677Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627778 (23288 ms)
2020-04-27T00:48:59.666Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627779 (14634 ms)
2020-04-27T00:57:55.708Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627779 (13870 ms)
2020-04-27T01:03:26.656Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627779 (22420 ms)
2020-04-27T01:10:37.584Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblock 0000000000000000000ca523cfa0b8690e06e78e9a02287c7910cdfdc4bda443 0 (13861 ms)
2020-04-27T01:44:23.676Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627783 (29478 ms)
2020-04-27T01:47:51.789Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627784 (14816 ms)
2020-04-27T01:47:55.669Z DEBUG plugin-bcli: Log pruned 241020 entries (mem 104877937 -> 44367821)
2020-04-27T03:23:18.143Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblock 000000000000000000020c4438cfdd411f6a68815047be31610ee063516fe755 0 (18891 ms)
2020-04-27T03:43:07.144Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627792 (29436 ms)
2020-04-27T04:31:45.853Z DEBUG plugin-bcli: Log pruned 250656 entries (mem 104910406 -> 49111736)
2020-04-27T04:35:21.728Z DEBUG plugin-bcli: sendrawtx exit 0 (bitcoin-cli -datadir=/var/lib/bitcoind sendrawtransaction ##redacted##)
2020-04-27T04:35:21.847Z DEBUG plugin-bcli: sendrawtx exit 0 (bitcoin-cli -datadir=/var/lib/bitcoind sendrawtransaction ##redacted##)
2020-04-27T04:35:22.016Z DEBUG plugin-bcli: sendrawtx exit 0 (bitcoin-cli -datadir=/var/lib/bitcoind sendrawtransaction ##redacted##)
2020-04-27T04:35:22.029Z DEBUG plugin-bcli: sendrawtx exit 0 (bitcoin-cli -datadir=/var/lib/bitcoind sendrawtransaction ##redacted##)
2020-04-27T04:35:22.040Z DEBUG plugin-bcli: sendrawtx exit 0 (bitcoin-cli -datadir=/var/lib/bitcoind sendrawtransaction ##redacted##)
2020-04-27T04:46:58.423Z DEBUG plugin-bcli: sendrawtx exit 0 (bitcoin-cli -datadir=/var/lib/bitcoind sendrawtransaction ##redacted##)
2020-04-27T18:08:58.657Z DEBUG plugin-manager: started(10206) /usr/bin/../libexec/c-lightning/plugins/bcli
Anyhow this happened, I think we should register SIGSHLD and die if
bcli
died will submit a patch.
@darosior: How do you not get a SIGPIPE
when you try to write to the bcli
process which has already terminated?
Thanks for the logs. That's what I wonder we should have tried to send a request to bcli
and died long before (we poll the feerate every 30 seconds!).
Could you send me your whole logs ? You can drop them to me by mail if you don't want to reveal them to the world. If you prefer not to, could you grep -i
for feerate
?
Could you send me your whole logs ?
@darosior: Logs, compressed with xz, encrypted using GPG with a symmetric cipher (passphrase emailed to you), then encoded using base64 because GitHub doesn't allow attachments of arbitrary content type: lightningd-redacted.log.xz.gpg.base64.txt
Edit: Just realized I could probably have obtained your GPG public key and used that instead of a shared secret. Oops.
Ok so we effectively seem to have stopped polling for feerates just before bcli
died.
My guess is we never restarted the polling timers because we never got a response from bcli
.
I still fail to understand how all the commands could have been written to bcli
succesfully, and why bcli
got killed (logs pruned).
Can you try with the latest release and
I think https://github.com/ElementsProject/lightning/pull/3675 fixes this, if you can try it or wait for the upcoming rc (log-level=io
? (Sorry to have no better recommandation..)log-level=io
would still help though if we need to re debug why bcli
is failing).
Why not fold this into #3213? Make bcli (and all our other builtin plugins) a "crucial" plugin which, if it is not loaded and initialized properly, we just die.
This is already the case for bcli
(and I thought the other way around was true too, that we would die if bcli
died but apparently not!).
Ok so we effectively seem to have stopped polling for feerates just before
bcli
died. My guess is we never restarted the polling timers because we never got a response frombcli
. I still fail to understand how all the commands could have been written tobcli
succesfully, and whybcli
got killed (logs pruned).
What would have happened if this sequence of events had occurred?:
lightningd
writes feerates command to bcli
.bcli
reads command from lightningd
.bcli
attempts to fork a child process to exec bitcoin-cli
, but fork fails due to resource exhaustion.Does bcli
die in this case? Probably it should return an error to lightningd
instead.
If lightningd
never got a response to its feerates command, would the polling timer fire again, or is the timer only reset when a response is read?
~Can you try with the latest release and
log-level=io
? (Sorry to have no better recommandation..)~ I think #3675 fixes this, if you can try it or wait for the upcoming rc (log-level=io
would still help though if we need to re debug whybcli
is failing).
I'm not keen to restart my C-Lightning more often than necessary, as it is a high-volume production node and I'd prefer to keep service disruptions to a minimum. I also prefer not to run release candidates for the same reason. I am leery of increasing the log-level
even further, as I am already generating gigabytes of logs in a short time at the debug
level. Also, the failure mode is rare and hard to reproduce, as it seems to occur under resource exhaustion, so I might have to run for a very long time before it happens again. It might be more fruitful to examine the possible scenario I outlined above (failing fork).
What would have happened if this sequence of events had occurred?:
It dies, and lightningd
would too at the next request it tries to send to bcli
. Thing is the timer to send the next polling request (namely getblock or estimatefees) are set in the "success" callback of the previous call, so I presume you ran in a deadlock. https://github.com/ElementsProject/lightning/pull/3675 fixes this by making lightningd
crash as soon as the connection to bcli
(or any other backend) is lost.
I'm not keen to restart my C-Lightning more often than necessary, as it is a high-volume production node and I'd prefer to keep service disruptions to a minimum.
I understand. I hope you won't have another resource exhaustion before next week :-)
What would have happened if this sequence of events had occurred?:
It dies, and
lightningd
would too at the next request it tries to send tobcli
.
When would the next request occur if the polling timer has not been re-armed? Would it occur if lightningd
needs to broadcast a transaction? How does lightningd
learn of new blocks? Is that also performed by the same polling timer?
I'm not keen to restart my C-Lightning more often than necessary
By the way, it'd be really cool if the log level could be changed on the fly. :wink:
When would the next request occur if the polling timer has not been re-armed?
Never..
Would it occur if lightningd needs to broadcast a transaction?
Ah, yes. I meant "never for the fees and block chain updates"
How does lightningd learn of new blocks?
Polling with getblockcount
Is that also performed by the same polling timer?
One for the fees, one for the topology (chain)
By the way, it'd be really cool if the log level could be changed on the fly. wink
Yeah ! I also personally ran into this willing regarding my nodes :-)
One for the fees, one for the topology (chain)
So I think in my case, both polling timers must have been affected, presumably because bcli
's calls to both estimatesmartfee
and getblockcount
failed to fork a bitcoin-cli
child process.
Thanks for the insight.
Yeah, that's this strange perfect synchronicity that made me bug. In addition this happened two times in a row ! Maybe we should differ the timers starting time, but still : I didn't expect them to be synchronous after some time running..
@darosior: Maybe you should log (at the DEBUG
level) when the timers fire.
Hmm I thought about adding this to the PR but this seemed overkill as we already log in the callback and we won't see this behaviour again as we now die as soon as bcli dies.
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ Le mardi, avril 28, 2020 5:50 PM, Matt Whitlock notifications@github.com a écrit :
@darosior: Maybe you should log (at the DEBUG level) when the timers fire.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
I think we can close this, it was solved by https://github.com/ElementsProject/lightning/pull/3675 .
3565 introduces a timeout to not wait the Bitcoin backend's response to the
init
message for ever.As an improvement, it would be nice to fail directly if the plugin dies.
It could be a good first issue ? In this case some hints: The Bitcoin backend plugin will signal its readiness by completing the handshake with
lightningd
. This handshake is completed by responding to theinit
message.If the plugin dies for whatever reason (for example
bcli
cannot connect tobitcoind
), the connection will be dropped andplugin_conn_finished
will be called: https://github.com/ElementsProject/lightning/blob/ac5cba22434c8133934b86f1dcf4f79956341b32/lightningd/plugin.c#L435-L444It frees the plugin struct so maybe an (untested) idea would be to attach a custom destructor to Bitcoin plugins ?