ElementsProject / lightning

Core Lightning — Lightning Network implementation focusing on spec compliance and performance
Other
2.84k stars 901 forks source link

Improvement: fail fast if the Bitcoin backend fails #3571

Closed darosior closed 4 years ago

darosior commented 4 years ago

3565 introduces a timeout to not wait the Bitcoin backend's response to the init message for ever.

As an improvement, it would be nice to fail directly if the plugin dies.

It could be a good first issue ? In this case some hints: The Bitcoin backend plugin will signal its readiness by completing the handshake with lightningd. This handshake is completed by responding to the init message.

If the plugin dies for whatever reason (for example bcli cannot connect to bitcoind), the connection will be dropped and plugin_conn_finished will be called: https://github.com/ElementsProject/lightning/blob/ac5cba22434c8133934b86f1dcf4f79956341b32/lightningd/plugin.c#L435-L444

It frees the plugin struct so maybe an (untested) idea would be to attach a custom destructor to Bitcoin plugins ?

ZmnSCPxj commented 4 years ago

Why not fold this into #3213? Make bcli (and all our other builtin plugins) a "crucial" plugin which, if it is not loaded and initialized properly, we just die.

Then to replace any builtin plugin we make it non-crucial and suppress loading it and then make a new crucial plugin, e.g. custom-bcli. E.g. --non-crucial-plugin=bcli --plugin-suppress=bcli --crucial-plugin=custom-bcli. This would work for all plugins, e.g. pay, autoclean....

whitslack commented 4 years ago

A couple of weeks ago, I discovered that my Lightning node had not been interacting with the blockchain for over a day because bcli had failed. The only way I noticed was that I had a channel that had been in CHANNELD_AWAITING_LOCKIN state for hours even though the peer was connected and the funding transaction was many blocks deep in the chain. When I examined the process tree, I noticed that bcli was in the defunct/zombie state. If this condition had continued for longer than my their_to_self_delay in any of my channels, my peers could have successfully cheated me. Obviously no good! So I appreciate the move toward making bcli a "crucial" plugin, without which lightningd will terminate.

darosior commented 4 years ago

Some logs would be very welcome as we would die on the next request (every 30s at most) if the connection to bcli is dropped. So bcli was in zombie state but still kept its connection opened ?

whitslack commented 4 years ago

bcli had been in zombie state (process exited but exit status not reaped by parent) for over 24 hours. I don't know why lightningd didn't notice, but the net effect was that nothing in my Lightning node that needs blockchain interaction (e.g., opening and closing channels) was making any progress. I assume if a funding transaction were to have been spent during this time (i.e., peer initiated unilateral close), my node would not have noticed it.

If it ever happens again, I'll have logs to submit, as I'm now running lightningd with debug logging enabled.

whitslack commented 4 years ago

It happened again. Here's an abbreviated process tree.

  PID USER     GROUP    START     TIME   RSS  NI NLWP COMMAND
    1 root     root     Apr07 00:00:23    20   0    1 init [3]
 2948 lightni+ lightni+ Apr07 11:23:10 613580  0    1   /usr/bin/lightningd --conf=/etc/lightning/lightningd.conf --lightning-dir=/var/lib/lightning --log-file=/var/log/lightningd.log --pid-file=/var/lib/lightning/lightningd.pid --rpc-file=/var/lib/lightning/lightningd.rpc --daemon
 2949 lightni+ lightni+ Apr07 00:00:00     0   0    1     /usr/bin/../libexec/c-lightning/plugins/pay
 2950 lightni+ lightni+ Apr07 00:17:22     0   0    1     [bcli] <defunct>
 2951 lightni+ lightni+ Apr07 00:00:00     0   0    1     /usr/bin/../libexec/c-lightning/plugins/autoclean
 2952 lightni+ lightni+ Apr07 00:00:00     0   0    1     /usr/bin/../libexec/c-lightning/plugins/fundchannel
 2966 lightni+ lightni+ Apr07 01:08:12 83132   0    1     /usr/libexec/c-lightning/lightning_hsmd
 2971 lightni+ lightni+ Apr07 00:20:30  1856   0    1     /usr/libexec/c-lightning/lightning_connectd
 2972 lightni+ lightni+ Apr07 12:14:52 143584  0    1     /usr/libexec/c-lightning/lightning_gossipd

(I have hundreds of lightning_channeld and lightning_openingd processes, which I filtered out of the process tree shown above.)

Of note is the bcli process, which has gone defunct. The last block that lightningd knows about is 627799, which was mined 12 hours ago. My log shows when lightningd learned of this block:

2020-04-27T05:01:46.444Z DEBUG lightningd: Adding block 627799: 0000000000000000000bf07b961efe8779ccdd87454f2eeb20ef1baad92a2072

That's the last Adding block message my log contains. I do have many subsequent messages like this one:

2020-04-27T06:13:30.925Z DEBUG 032c4b954f0f171b694b5e8e8323589e54196b48cf2efc27692513a360cb11d76f-gossipd: Ignoring future channel_announcment for 627803x1839x1 (current block 627799)

There are no messages containing bcli after Adding block 627799.

My bitcoind process has been running uninterrupted the entire time and is still responding to RPC requests as I write this report. My kernel log shows that none of C-Lightning's processes were killed by the OOM killer. The OOM killer did kill some Chrome processes starting at 01:26, but that was well after block 627800 was mined at 01:17, and bcli crashed before block 627800, so I think that's unrelated.

I do see exactly two groups of memory-related messages like this:

2020-04-27T06:32:38.176Z UNUSUAL lightningd: subd lightning_channeld failed: Cannot allocate memory
2020-04-27T06:32:38.177Z UNUSUAL 03b1be68b8f564fe53f5456cf4bec901ba968fb72ed7291279ab0c87e0d22f1f49-chan#41206: Could not subdaemon channel: Cannot allocate memory
2020-04-27T06:32:38.177Z INFO 03b1be68b8f564fe53f5456cf4bec901ba968fb72ed7291279ab0c87e0d22f1f49-chan#41206: Peer transient failure in CHANNELD_NORMAL: Failed to subdaemon channel

So clearly my system was running low on available memory this morning. (I was compiling Chromium overnight, which is what triggered the OOM killer on my chrome processes. The build completed successfully.)

Maybe it's possible that the bcli plugin process exited due to a failure to allocate memory but didn't log any message about it? In any case, lightningd seems completely unaware that bcli has exited.

I am going to restart my C-Lightning now, but I can grep my log for anything you want.

darosior commented 4 years ago

Could you grep for bcli please ?

darosior commented 4 years ago

Anyhow this happened, I think we should register SIGSHLD and die if bcli died will submit a patch.

whitslack commented 4 years ago

Could you grep for bcli please ?

@darosior: Here's a grep of the log from midnight (well before the failure) until I restarted C-Lightning.

# grep -a '^2020-04-27T.*bcli' /var/log/lightningd.log | sed -n '0,/started/p'
2020-04-27T00:00:40.297Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627774 (23468 ms)
2020-04-27T00:17:00.390Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627775 (25481 ms)
2020-04-27T00:18:41.818Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627775 (11355 ms)
2020-04-27T00:23:13.474Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627776 (40261 ms)
2020-04-27T00:39:19.749Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627777 (15651 ms)
2020-04-27T00:41:41.165Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627778 (11148 ms)
2020-04-27T00:43:28.371Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627778 (17144 ms)
2020-04-27T00:44:21.677Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627778 (23288 ms)
2020-04-27T00:48:59.666Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627779 (14634 ms)
2020-04-27T00:57:55.708Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627779 (13870 ms)
2020-04-27T01:03:26.656Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627779 (22420 ms)
2020-04-27T01:10:37.584Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblock 0000000000000000000ca523cfa0b8690e06e78e9a02287c7910cdfdc4bda443 0 (13861 ms)
2020-04-27T01:44:23.676Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627783 (29478 ms)
2020-04-27T01:47:51.789Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627784 (14816 ms)
2020-04-27T01:47:55.669Z DEBUG plugin-bcli: Log pruned 241020 entries (mem 104877937 -> 44367821)
2020-04-27T03:23:18.143Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblock 000000000000000000020c4438cfdd411f6a68815047be31610ee063516fe755 0 (18891 ms)
2020-04-27T03:43:07.144Z UNUSUAL plugin-bcli: bitcoin-cli: finished bitcoin-cli -datadir=/var/lib/bitcoind getblockhash 627792 (29436 ms)
2020-04-27T04:31:45.853Z DEBUG plugin-bcli: Log pruned 250656 entries (mem 104910406 -> 49111736)
2020-04-27T04:35:21.728Z DEBUG plugin-bcli: sendrawtx exit 0 (bitcoin-cli -datadir=/var/lib/bitcoind sendrawtransaction ##redacted##)
2020-04-27T04:35:21.847Z DEBUG plugin-bcli: sendrawtx exit 0 (bitcoin-cli -datadir=/var/lib/bitcoind sendrawtransaction ##redacted##)
2020-04-27T04:35:22.016Z DEBUG plugin-bcli: sendrawtx exit 0 (bitcoin-cli -datadir=/var/lib/bitcoind sendrawtransaction ##redacted##)
2020-04-27T04:35:22.029Z DEBUG plugin-bcli: sendrawtx exit 0 (bitcoin-cli -datadir=/var/lib/bitcoind sendrawtransaction ##redacted##)
2020-04-27T04:35:22.040Z DEBUG plugin-bcli: sendrawtx exit 0 (bitcoin-cli -datadir=/var/lib/bitcoind sendrawtransaction ##redacted##)
2020-04-27T04:46:58.423Z DEBUG plugin-bcli: sendrawtx exit 0 (bitcoin-cli -datadir=/var/lib/bitcoind sendrawtransaction ##redacted##)
2020-04-27T18:08:58.657Z DEBUG plugin-manager: started(10206) /usr/bin/../libexec/c-lightning/plugins/bcli
whitslack commented 4 years ago

Anyhow this happened, I think we should register SIGSHLD and die if bcli died will submit a patch.

@darosior: How do you not get a SIGPIPE when you try to write to the bcli process which has already terminated?

darosior commented 4 years ago

Thanks for the logs. That's what I wonder we should have tried to send a request to bcli and died long before (we poll the feerate every 30 seconds!). Could you send me your whole logs ? You can drop them to me by mail if you don't want to reveal them to the world. If you prefer not to, could you grep -i for feerate ?

whitslack commented 4 years ago

Could you send me your whole logs ?

@darosior: Logs, compressed with xz, encrypted using GPG with a symmetric cipher (passphrase emailed to you), then encoded using base64 because GitHub doesn't allow attachments of arbitrary content type: lightningd-redacted.log.xz.gpg.base64.txt

Edit: Just realized I could probably have obtained your GPG public key and used that instead of a shared secret. Oops.

darosior commented 4 years ago

Ok so we effectively seem to have stopped polling for feerates just before bcli died. My guess is we never restarted the polling timers because we never got a response from bcli. I still fail to understand how all the commands could have been written to bcli succesfully, and why bcli got killed (logs pruned).

Can you try with the latest release and log-level=io ? (Sorry to have no better recommandation..) I think https://github.com/ElementsProject/lightning/pull/3675 fixes this, if you can try it or wait for the upcoming rc (log-level=io would still help though if we need to re debug why bcli is failing).

darosior commented 4 years ago

Why not fold this into #3213? Make bcli (and all our other builtin plugins) a "crucial" plugin which, if it is not loaded and initialized properly, we just die.

This is already the case for bcli (and I thought the other way around was true too, that we would die if bcli died but apparently not!).

whitslack commented 4 years ago

Ok so we effectively seem to have stopped polling for feerates just before bcli died. My guess is we never restarted the polling timers because we never got a response from bcli. I still fail to understand how all the commands could have been written to bcli succesfully, and why bcli got killed (logs pruned).

What would have happened if this sequence of events had occurred?:

  1. lightningd writes feerates command to bcli.
  2. bcli reads command from lightningd.
  3. bcli attempts to fork a child process to exec bitcoin-cli, but fork fails due to resource exhaustion.

Does bcli die in this case? Probably it should return an error to lightningd instead.

If lightningd never got a response to its feerates command, would the polling timer fire again, or is the timer only reset when a response is read?

~Can you try with the latest release and log-level=io ? (Sorry to have no better recommandation..)~ I think #3675 fixes this, if you can try it or wait for the upcoming rc (log-level=io would still help though if we need to re debug why bcli is failing).

I'm not keen to restart my C-Lightning more often than necessary, as it is a high-volume production node and I'd prefer to keep service disruptions to a minimum. I also prefer not to run release candidates for the same reason. I am leery of increasing the log-level even further, as I am already generating gigabytes of logs in a short time at the debug level. Also, the failure mode is rare and hard to reproduce, as it seems to occur under resource exhaustion, so I might have to run for a very long time before it happens again. It might be more fruitful to examine the possible scenario I outlined above (failing fork).

darosior commented 4 years ago

What would have happened if this sequence of events had occurred?:

It dies, and lightningd would too at the next request it tries to send to bcli. Thing is the timer to send the next polling request (namely getblock or estimatefees) are set in the "success" callback of the previous call, so I presume you ran in a deadlock. https://github.com/ElementsProject/lightning/pull/3675 fixes this by making lightningd crash as soon as the connection to bcli (or any other backend) is lost.

I'm not keen to restart my C-Lightning more often than necessary, as it is a high-volume production node and I'd prefer to keep service disruptions to a minimum.

I understand. I hope you won't have another resource exhaustion before next week :-)

whitslack commented 4 years ago

What would have happened if this sequence of events had occurred?:

It dies, and lightningd would too at the next request it tries to send to bcli.

When would the next request occur if the polling timer has not been re-armed? Would it occur if lightningd needs to broadcast a transaction? How does lightningd learn of new blocks? Is that also performed by the same polling timer?

whitslack commented 4 years ago

I'm not keen to restart my C-Lightning more often than necessary

By the way, it'd be really cool if the log level could be changed on the fly. :wink:

darosior commented 4 years ago

When would the next request occur if the polling timer has not been re-armed?

Never..

Would it occur if lightningd needs to broadcast a transaction?

Ah, yes. I meant "never for the fees and block chain updates"

How does lightningd learn of new blocks?

Polling with getblockcount

Is that also performed by the same polling timer?

One for the fees, one for the topology (chain)

By the way, it'd be really cool if the log level could be changed on the fly. wink

Yeah ! I also personally ran into this willing regarding my nodes :-)

whitslack commented 4 years ago

One for the fees, one for the topology (chain)

So I think in my case, both polling timers must have been affected, presumably because bcli's calls to both estimatesmartfee and getblockcount failed to fork a bitcoin-cli child process.

Thanks for the insight.

darosior commented 4 years ago

Yeah, that's this strange perfect synchronicity that made me bug. In addition this happened two times in a row ! Maybe we should differ the timers starting time, but still : I didn't expect them to be synchronous after some time running..

whitslack commented 4 years ago

@darosior: Maybe you should log (at the DEBUG level) when the timers fire.

darosior commented 4 years ago

Hmm I thought about adding this to the PR but this seemed overkill as we already log in the callback and we won't see this behaviour again as we now die as soon as bcli dies.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ Le mardi, avril 28, 2020 5:50 PM, Matt Whitlock notifications@github.com a écrit :

@darosior: Maybe you should log (at the DEBUG level) when the timers fire.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

darosior commented 4 years ago

I think we can close this, it was solved by https://github.com/ElementsProject/lightning/pull/3675 .