ElementsProject / lightning

Core Lightning — Lightning Network implementation focusing on spec compliance and performance
Other
2.83k stars 897 forks source link

Connection from eclair lost #1587

Closed kallerosenbaum closed 6 years ago

kallerosenbaum commented 6 years ago

Issue and Steps to Reproduce

I have a channel between my eclair wallet and my c-lightning node. The connection seems to break every second and eclair reconnects. The channel status in Eclair toggles between OFFLINE and NORMAL and a the c-lightning log (of a newly started node) says:

2018-06-15T10:35:52.278Z lightningd(19671): Server started with public key 03b99d5aed3eb17ec8e68e9111756aaba989c00d747f05b59fb4905f7b2ec34068, alias feleena (color #e000e5) and lightningd v0.5.2-2016-11-21-2891-g4990b6b
2018-06-15T10:35:52.553Z lightningd(19671): No fee estimate for Immediate: basing on default fee rate
2018-06-15T10:35:52.553Z lightningd(19671): No fee estimate for Normal: basing on Immediate rate
2018-06-15T10:35:52.553Z lightningd(19671): No fee estimate for Slow: basing on Immediate rate
2018-06-15T10:35:52.553Z lightningd(19671): No fee estimate for Slow: basing on Normal rate
2018-06-15T10:35:53.671Z lightningd(19671): 039376f846cb4e137f3474baa7fbe74ec627745c7d5c54935b99fbb1b60a62c9b3 chan #4: Peer permanent failure in ONCHAIN: Funding transaction spent
2018-06-15T10:35:53.671Z lightningd(19671):      (tx 2a7c4114537264bdd3da23c523964eac97385969cfe10b73c90a5040b181ba2c)
2018-06-15T10:35:53.675Z lightningd(19671): 039376f846cb4e137f3474baa7fbe74ec627745c7d5c54935b99fbb1b60a62c9b3 chan #4: State changed from ONCHAIN to FUNDING_SPEND_SEEN
2018-06-15T10:35:53.883Z lightningd(19671): 039376f846cb4e137f3474baa7fbe74ec627745c7d5c54935b99fbb1b60a62c9b3 chan #4: State changed from FUNDING_SPEND_SEEN to ONCHAIN
2018-06-15T10:36:58.360Z lightningd(19671): lightning_channeld-03c6c4f17deda0593ac0b962cede0870488ae1c1f0d15891d88fd430a4a0f93e73 chan #2: Gossipd gave us bad send_gossip message 0bc80000
2018-06-15T10:36:58.360Z lightningd(19671): lightning_channeld-03c6c4f17deda0593ac0b962cede0870488ae1c1f0d15891d88fd430a4a0f93e73 chan #2: Peer connection lost
2018-06-15T10:36:58.360Z lightningd(19671): 03c6c4f17deda0593ac0b962cede0870488ae1c1f0d15891d88fd430a4a0f93e73 chan #2: Peer transient failure in CHANNELD_NORMAL: lightning_channeld: Owning subdaemon lightning_channeld died (62208)
2018-06-15T10:36:59.868Z lightningd(19671): lightning_channeld-03c6c4f17deda0593ac0b962cede0870488ae1c1f0d15891d88fd430a4a0f93e73 chan #2: Gossipd gave us bad send_gossip message 0bc80000
2018-06-15T10:36:59.868Z lightningd(19671): lightning_channeld-03c6c4f17deda0593ac0b962cede0870488ae1c1f0d15891d88fd430a4a0f93e73 chan #2: Peer connection lost
2018-06-15T10:36:59.868Z lightningd(19671): 03c6c4f17deda0593ac0b962cede0870488ae1c1f0d15891d88fd430a4a0f93e73 chan #2: Peer transient failure in CHANNELD_NORMAL: lightning_channeld: Owning subdaemon lightning_channeld died (62208)
2018-06-15T10:37:01.343Z lightningd(19671): lightning_channeld-03c6c4f17deda0593ac0b962cede0870488ae1c1f0d15891d88fd430a4a0f93e73 chan #2: Gossipd gave us bad send_gossip message 0bc80000
2018-06-15T10:37:01.343Z lightningd(19671): lightning_channeld-03c6c4f17deda0593ac0b962cede0870488ae1c1f0d15891d88fd430a4a0f93e73 chan #2: Peer connection lost
2018-06-15T10:37:01.343Z lightningd(19671): 03c6c4f17deda0593ac0b962cede0870488ae1c1f0d15891d88fd430a4a0f93e73 chan #2: Peer transient failure in CHANNELD_NORMAL: lightning_channeld: Owning subdaemon lightning_channeld died (62208)

... and so on indefinitely ...

Not sure if this is Eclair's, c-lightning's or my fault, but I report it on c-lightning because that's where I have some logging info.

I have some other channels open in c-lightning, and they all seem to work as expected. So I suspect it's an Eclair-clightning issue.

getinfo output

$ lcli getinfo
{
  "id": "03b99d5aed3eb17ec8e68e9111756aaba989c00d747f05b59fb4905f7b2ec34068", 
  "port": 9735, 
  "address": [
    {
      "type": "ipv4", 
      "address": "46.59.13.35", 
      "port": 9735
    }
  ], 
  "binding": [
    {
      "type": "ipv4", 
      "address": "0.0.0.0", 
      "port": 9735
    }
  ], 
  "version": "v0.5.2-2016-11-21-2891-g4990b6b", 
  "blockheight": 527554, 
  "network": "bitcoin"
}
cdecker commented 6 years ago

~That's really strange behavior. It could be an invalid gossip_ping message:~

https://github.com/ElementsProject/lightning/blob/f5ebf8e231bdf0890c123cac69ec9bd37cae6850/gossipd/gossip_wire.csv#L146-L150

~But it's missing the id which makes it invalid. And even if it were valid, it should not be forwarded to channeld...~

Edit: mixed up the message type, this is actually msgtype 3016, gossip_send_gossip. See below.

kallerosenbaum commented 6 years ago

So here's a tcpdump taken as sudo tcpdump -n -i eth0 -w dump.dat 'host c-5eea209f-74736162.cust.telenor.se and tcp port 9735', maybe it can be helpful? It covers a about 4 seconds worth of traffic.

dump.dat.gz

Maybe it's better to turn on debug log level in lightningd? Tips on debugging arguments to lightningd?

kallerosenbaum commented 6 years ago

Here's last 1000 lines of debug log (not taken simultaneously as the tcpdump above):

last1000.log.gz

cdecker commented 6 years ago

I'm afraid the transport layer is encrypted and we won't be able to decrypt it.

cdecker commented 6 years ago

I'm actually pretty sure this issue is on our side, not Eclair. That message should not be sent there imho

cdecker commented 6 years ago

@kallerosenbaum does this error persist across restarts? If yes, can you make a backup of the .lightning/gossip_store file, remove it and then restart the node?

cdecker commented 6 years ago

My error, it seems that this is msgtype=3016, a.k.a., gossip_send_gossip, attempting to send an empty gossip message.

Can you checkout and compile the issue-1587 branch which includes an assert that'll tell us how we got there?

kallerosenbaum commented 6 years ago

It persists across restarts. I did a series of tests, in each test except 1, I used the backed up gossip_store. :

  1. Removing gossip_store: Problem dissappears
  2. Adding back the old gossip_store: Problem reappears
  3. Building and restarting on branch issue-1587: Problem dissappears
  4. Restarting and adding back the old gossip_store again, just to make sure: Problem does not appear (logfile issue-1587-branch.log)
  5. Building and restarting on master: Problem does not appear
  6. Building and restarting on git checkout 9d721ecb994813a84c9c631deb23117678bbd0d9: Problem reappears

The problem seems to be fixed somewhere after 9d721ecb994813a84c9c631deb23117678bbd0d9.

I'm uploading the issue-1587-branch eventhough it didn't happen there: issue-1587-branch.log.gz

cdecker commented 6 years ago

Excellent work @kallerosenbaum. Still a bit curious about which commit accidentally fixed this issue. Could you send me a copy of your gossip_store? It stores only publicly available information, so that should not be a concern. With it we can bisect and pin down the exact commit.

kallerosenbaum commented 6 years ago

Sure: gossip_store_backup.gz

cdecker commented 6 years ago

Thanks, I'll try to figure out what went wrong and report back :+1:

cdecker commented 6 years ago

Hm, I still can't reproduce this, even with commit 9d721ecb994813a84c9c631deb23117678bbd0d9. But if I understood correctly master doesn't have that issue anymore, right?

cdecker commented 6 years ago

@rustyrussell pointed out that this was fixed in 518f8fc8d0231da6fe8396180b6d4be1a87ab746, cheers :wink:

kallerosenbaum commented 6 years ago

Ok, thanks for sorting it out.