ElementsProject / lightning

Core Lightning — Lightning Network implementation focusing on spec compliance and performance
Other
2.83k stars 899 forks source link

lightningd hanging, not opening RPC port, not syncing, reducing latest block by 15 #2687

Closed svewa closed 5 years ago

svewa commented 5 years ago

I have c-lightning running for over a year now, but some time maybe two month ago I noticed it stopped and would not properly start again. I tried updating to 0.7.0 and now to latest git, does not help.

config file:

log-level=io
network=bitcoin

starting lightningd:

ln@c-lightning:~$ lightning/lightningd/lightningd 
2019-05-28T22:08:01.236Z plugin-autoclean[OUT] 7b226a736f6e727063223a22322e30222c226964223a312c226d6574686f64223a226765746d616e6966657374222c22706172616d73223a7b7d7d0a0a
2019-05-28T22:08:01.236Z plugin-pay[OUT] 7b226a736f6e727063223a22322e30222c226964223a302c226d6574686f64223a226765746d616e6966657374222c22706172616d73223a7b7d7d0a0a
2019-05-28T22:08:01.297Z lightningd(3740): testing /home/ln/lightning/lightningd/lightning_channeld
2019-05-28T22:08:01.301Z lightningd(3740): testing /home/ln/lightning/lightningd/lightning_closingd
2019-05-28T22:08:01.305Z lightningd(3740): testing /home/ln/lightning/lightningd/lightning_connectd
2019-05-28T22:08:01.311Z lightningd(3740): testing /home/ln/lightning/lightningd/lightning_gossipd
2019-05-28T22:08:01.315Z lightningd(3740): testing /home/ln/lightning/lightningd/lightning_hsmd
2019-05-28T22:08:01.319Z lightningd(3740): testing /home/ln/lightning/lightningd/lightning_onchaind
2019-05-28T22:08:01.323Z lightningd(3740): testing /home/ln/lightning/lightningd/lightning_openingd
2019-05-28T22:08:01.475Z lightning_hsmd(3750): pid 3750, msgfd 17
2019-05-28T22:08:01.508Z lightning_connectd(3751): pid 3751, msgfd 20
2019-05-28T22:08:01.509Z lightning_hsmd(3750): Client: Received message 11 from client
2019-05-28T22:08:01.509Z lightning_hsmd(3750): Client: Received message 9 from client
2019-05-28T22:08:01.509Z lightning_hsmd(3750): new_client: 0
2019-05-28T22:08:01.725Z lightning_connectd(3751): Created IPv6 listener on port 9735
2019-05-28T22:08:01.725Z lightning_connectd(3751): Failed to connect 10 socket: Network is unreachable
2019-05-28T22:08:01.725Z lightning_connectd(3751): Created IPv4 listener on port 9735
2019-05-28T22:08:01.725Z lightning_connectd(3751): REPLY WIRE_CONNECTCTL_INIT_REPLY with 0 fds
2019-05-28T22:08:01.727Z lightning_gossipd(3752): pid 3752, msgfd 19
2019-05-28T22:08:01.762Z lightningd(3740): Loaded 0 HTLC signatures from DB
2019-05-28T22:08:01.766Z lightningd(3740): Loaded 0 HTLC signatures from DB
2019-05-28T22:08:01.772Z lightningd(3740): Loaded 0 HTLC signatures from DB
2019-05-28T22:08:01.779Z lightningd(3740): Loaded 0 HTLC signatures from DB
2019-05-28T22:08:01.785Z lightningd(3740): Loaded 0 HTLC signatures from DB
2019-05-28T22:08:01.789Z lightningd(3740): Loaded 0 HTLC signatures from DB
2019-05-28T22:08:01.792Z lightningd(3740): Loaded 6 channels from DB
2019-05-28T22:08:01.793Z lightningd(3740): Loading HTLCs for channel 14298
2019-05-28T22:08:01.794Z lightningd(3740): Restored 0 incoming and 0 outgoing HTLCS
2019-05-28T22:08:01.794Z lightningd(3740): Loading HTLCs for channel 15388
2019-05-28T22:08:01.795Z lightningd(3740): Restored 0 incoming and 0 outgoing HTLCS
2019-05-28T22:08:01.796Z lightningd(3740): Loading HTLCs for channel 15425
2019-05-28T22:08:01.797Z lightningd(3740): Restored 0 incoming and 0 outgoing HTLCS
2019-05-28T22:08:01.798Z lightningd(3740): Loading HTLCs for channel 18348
2019-05-28T22:08:01.799Z lightningd(3740): Restored 0 incoming and 0 outgoing HTLCS
2019-05-28T22:08:01.799Z lightningd(3740): Loading HTLCs for channel 21447
2019-05-28T22:08:01.800Z lightningd(3740): Restored 0 incoming and 0 outgoing HTLCS
2019-05-28T22:08:01.801Z lightningd(3740): Loading HTLCs for channel 23172
2019-05-28T22:08:01.801Z lightningd(3740): Restored 0 incoming and 0 outgoing HTLCS
2019-05-28T22:08:02.525Z lightning_hsmd(3750): Client: Received message 9 from client
2019-05-28T22:08:02.526Z lightning_hsmd(3750): new_client: 0
2019-05-28T22:08:02.526Z lightning_hsmd(3750): Client: Received message 10 from client
2019-05-28T22:08:02.527Z lightning_hsmd(3750): Client: Received message 10 from client
2019-05-28T22:08:02.527Z lightning_gossipd(3752): total store load time: 0 msec
2019-05-28T22:08:02.528Z lightning_hsmd(3750): Client: Received message 10 from client
2019-05-28T22:08:02.528Z lightning_gossipd(3752): gossip_store: Read 0/0/0/0 cannounce/cupdate/nannounce/cdelete from store in 1 bytes
2019-05-28T22:08:02.528Z lightning_hsmd(3750): Client: Received message 10 from client
2019-05-28T22:08:02.528Z lightning_hsmd(3750): Client: Received message 10 from client
2019-05-28T22:08:02.529Z lightning_hsmd(3750): Client: Received message 10 from client
2019-05-28T22:08:03.684Z lightningd(3740): bitcoin-cli estimatesmartfee 2 CONSERVATIVE exited with status 1
2019-05-28T22:08:07.783Z lightningd(3740): Adding block 556142: 000000000000000000173e47871cbacd921ab24034ba9de80e40af84d3d703d5
ZmnSCPxj commented 5 years ago

The last thing I am seeing on strace is that we are polling on fds 5, 7, 17, 20, and 19.

This poll is thus the main io-loop.

The issue is that topology should be scanning forward for new blocks continuously, but strace shows it has only scanned one block, then after reading it, topology updates the db, but suddenly we enter the main loop (which we should not, because, topology should have its own loop....?).

@cdecker there may be a race condition? The main io_loop blocks on poll, but the topology loop should be running?

ZmnSCPxj commented 5 years ago

I believe I have cracked this issue.

After starting up plugins and the core daemons (hsmd, gossipd, connectd), one of the things we do is setup_topology:

https://github.com/ElementsProject/lightning/blob/11412435ce0cae92205178db9a9f4b64d0d78830/lightningd/lightningd.c#L744-L747

As the comment says, it does an io_loop, which (in the absence of struct timers) will only exit if we execute a io_break.

Looking inside setup_topology, we see that io_loop is called without any struct timers (i.e. the NULL arguments):

https://github.com/ElementsProject/lightning/blob/11412435ce0cae92205178db9a9f4b64d0d78830/lightningd/chaintopology.c#L896-L897

Now, just prior to the above io_loop, we see that we initiate fee estimation:

https://github.com/ElementsProject/lightning/blob/11412435ce0cae92205178db9a9f4b64d0d78830/lightningd/chaintopology.c#L894

In particular, we only execute io_break if we get our first block and we get our first fee estimation. That's this code below:

https://github.com/ElementsProject/lightning/blob/11412435ce0cae92205178db9a9f4b64d0d78830/lightningd/chaintopology.c#L29-L38

And if we look at the logs:

2019-05-28T22:08:03.684Z lightningd(3740): bitcoin-cli estimatesmartfee 2 CONSERVATIVE exited with status 1

Looks like fee estimation from bitcoin-cli exited with status 1.

We search for this text exited with status and see:

https://github.com/ElementsProject/lightning/blob/11412435ce0cae92205178db9a9f4b64d0d78830/lightningd/bitcoind.c#L157-L159

And we scroll down a little and also see that, in case of such a failure, the bitcoind system will helpfully automatically retry the command 1 second later (in particular, it will not invoke the callback inside chaintopology yet):

https://github.com/ElementsProject/lightning/blob/11412435ce0cae92205178db9a9f4b64d0d78830/lightningd/bitcoind.c#L163-L165

Of course, it uses a timer..... which the setup_topology loop does not enable. Remember?

https://github.com/ElementsProject/lightning/blob/11412435ce0cae92205178db9a9f4b64d0d78830/lightningd/chaintopology.c#L896-L897

TLDR: If fee estimation completely fails, we retry 1 second later. Unfortunately, the initial setup_topology creates a io_loop that ignores timers. So we never retry fee estimation and just end up hanging.

@cdecker @rustyrussell it seems to me that the simplest solution is to write a io_loop_with_timers that handles timers, and merge the mainloop and the tiny io_loop in setup_topology to io_loop_with_timers. Is this safe? Why does the chaintopology io_loop ignore timers?

svewa commented 5 years ago

https://drop.geekbox.info/feeestimateerror.log

It seems the error was the whole time, that my bitcoind was 17.1 and did not support the 2nd paramter to the getsmartfeeestimate.... your patch made the verbose error message visible, but a check if the available bitcoind has the required features or minimum version would be helpful for sure, too.

ZmnSCPxj commented 5 years ago

@svewa okay, I shall open a separate new issue.

ZmnSCPxj commented 5 years ago

2697

darosior commented 5 years ago

It seems the error was the whole time, that my bitcoind was 17.1 and did not support the 2nd paramter to the getsmartfeeestimate.... your patch made the verbose error message visible, but a check if the available bitcoind has the required features or minimum version would be helpful for sure, too.

No, you get that because bitcoind could not estimatesmartfee (not because of a bad JSONRPC arg), I had this issue before I don't remember how I solved it. I'll search and keep you updated.

darosior commented 5 years ago

Did you change your datetime (had a powercut or something) recently ? What are the bitcoin logs on an estimatesmartfee ?

svewa commented 5 years ago

No, you get that because bitcoind could not estimatesmartfee (not because of a bad JSONRPC arg)

Yes, I did. Just see the output in https://drop.geekbox.info/feeestimateerror.log

2019-05-31T18:22:32.564Z lightningd(10275): bitcoin-cli estimatesmartfee 2 CONSERVATIVE exited with status 1

When you run this very command bitcoin-cli estimatesmartfee 2 CONSERVATIVE against bitcoind 0.17.1, you get the exact same error as I got. It's bitcoind telling you, that you gave the wrong parameters. If you run it against 0.18.0 it works, also the help text of bitcoind describing the estimatesmartfee reflects that.