ElementsProject / lightning

Core Lightning — Lightning Network implementation focusing on spec compliance and performance
Other
2.84k stars 901 forks source link

Neverending stream of log messages "Owning subdaemon channeld died" #4698

Open fiatjaf opened 3 years ago

fiatjaf commented 3 years ago

Issue and Steps to Reproduce

2021-08-05-082930_1014x942_scrot

See also this video so you can feel my pain as the log messages keep going on and on while I watch:

video

Basically the logs repeat these lines:

2021-08-05T11:31:19.893Z INFO    03144fcc73cea41a002b2865f98190ab90e4ff58a2ce24d3870f5079081e42922d-channeld-chan#8559: Peer connection lost
2021-08-05T11:31:19.894Z INFO    03144fcc73cea41a002b2865f98190ab90e4ff58a2ce24d3870f5079081e42922d-chan#8559: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)

I had more nodes in this situation, as explained on https://t.me/lightningd/6427. There is a related debugging session with Rusty around this: https://t.me/lightningd/6565.

At least one of the nodes that was in this state closed the channel unilaterally. Logs from their side: image

getinfo output

{
   "id": "02c16cca44562b590dd279c942200bdccfd4f990c3a69fad620c10ef2f8228eaff",
   "alias": "@lntxbot",
   "color": "296683",
   "num_peers": 80,
   "num_pending_channels": 1,
   "num_active_channels": 71,
   "num_inactive_channels": 0,
   "address": [
      {
         "type": "ipv4",
         "address": "5.2.67.89",
         "port": 9735
      }
   ],
   "binding": [
      {
         "type": "ipv4",
         "address": "0.0.0.0",
         "port": 9735
      }
   ],
   "version": "v0.10.1rc1-3-g7bde0ea",
   "blockheight": 694304,
   "network": "bitcoin",
   "msatoshi_fees_collected": 35652632,
   "fees_collected_msat": "35652632msat",
   "lightning-dir": "/home/fiatjaf/.lightning1/bitcoin"
}
openoms commented 3 years ago

Ran into a similar issue today. Tried to restart, but it keeps going, repeating:

2021-09-30T21:51:45.211Z INFO    021f98b9898720f8633c93faf0aa54ab399d277464e502d1111b233c2cf4064828-chan#5: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (-1)
2021-09-30T21:51:48.315Z **BROKEN** 021f98b9898720f8633c93faf0aa54ab399d277464e502d1111b233c2cf4064828-channeld-chan#5: FATAL SIGNAL 6 (version 0.10.1)
2021-09-30T21:51:48.315Z **BROKEN** 021f98b9898720f8633c93faf0aa54ab399d277464e502d1111b233c2cf4064828-channeld-chan#5: backtrace: common/daemon.c:44 (send_backtrace) 0x56513af576e3
2021-09-30T21:51:48.316Z **BROKEN** 021f98b9898720f8633c93faf0aa54ab399d277464e502d1111b233c2cf4064828-channeld-chan#5: backtrace: common/daemon.c:52 (crashdump) 0x56513af57735
2021-09-30T21:51:48.316Z **BROKEN** 021f98b9898720f8633c93faf0aa54ab399d277464e502d1111b233c2cf4064828-channeld-chan#5: backtrace: (null):0 ((null)) 0x7f09e7c1e20f
2021-09-30T21:51:48.316Z **BROKEN** 021f98b9898720f8633c93faf0aa54ab399d277464e502d1111b233c2cf4064828-channeld-chan#5: backtrace: (null):0 ((null)) 0x7f09e7c1e18b
2021-09-30T21:51:48.316Z **BROKEN** 021f98b9898720f8633c93faf0aa54ab399d277464e502d1111b233c2cf4064828-channeld-chan#5: backtrace: (null):0 ((null)) 0x7f09e7bfd858
2021-09-30T21:51:48.316Z **BROKEN** 021f98b9898720f8633c93faf0aa54ab399d277464e502d1111b233c2cf4064828-channeld-chan#5: backtrace: (null):0 ((null)) 0x7f09e7bfd728
2021-09-30T21:51:48.316Z **BROKEN** 021f98b9898720f8633c93faf0aa54ab399d277464e502d1111b233c2cf4064828-channeld-chan#5: backtrace: (null):0 ((null)) 0x7f09e7c0ef35
2021-09-30T21:51:48.317Z **BROKEN** 021f98b9898720f8633c93faf0aa54ab399d277464e502d1111b233c2cf4064828-channeld-chan#5: backtrace: channeld/channeld.c:409 (set_channel_type) 0x56513af4762c
2021-09-30T21:51:48.317Z **BROKEN** 021f98b9898720f8633c93faf0aa54ab399d277464e502d1111b233c2cf4064828-channeld-chan#5: backtrace: channeld/channeld.c:3109 (peer_reconnect) 0x56513af4c57c
2021-09-30T21:51:48.317Z **BROKEN** 021f98b9898720f8633c93faf0aa54ab399d277464e502d1111b233c2cf4064828-channeld-chan#5: backtrace: channeld/channeld.c:3826 (init_channel) 0x56513af4cdb3
2021-09-30T21:51:48.317Z **BROKEN** 021f98b9898720f8633c93faf0aa54ab399d277464e502d1111b233c2cf4064828-channeld-chan#5: backtrace: channeld/channeld.c:3894 (main) 0x56513af4cfa5
2021-09-30T21:51:48.318Z **BROKEN** 021f98b9898720f8633c93faf0aa54ab399d277464e502d1111b233c2cf4064828-channeld-chan#5: backtrace: (null):0 ((null)) 0x7f09e7bff0b2
2021-09-30T21:51:48.318Z **BROKEN** 021f98b9898720f8633c93faf0aa54ab399d277464e502d1111b233c2cf4064828-channeld-chan#5: backtrace: (null):0 ((null)) 0x56513af4657d
2021-09-30T21:51:48.318Z **BROKEN** 021f98b9898720f8633c93faf0aa54ab399d277464e502d1111b233c2cf4064828-channeld-chan#5: backtrace: (null):0 ((null)) 0xffffffffffffffff
2021-09-30T21:51:48.318Z **BROKEN** 021f98b9898720f8633c93faf0aa54ab399d277464e502d1111b233c2cf4064828-channeld-chan#5: FATAL SIGNAL (version 0.10.1)
2021-09-30T21:51:48.318Z **BROKEN** 021f98b9898720f8633c93faf0aa54ab399d277464e502d1111b233c2cf4064828-channeld-chan#5: backtrace: common/daemon.c:44 (send_backtrace) 0x56513af576e3
2021-09-30T21:51:48.319Z **BROKEN** 021f98b9898720f8633c93faf0aa54ab399d277464e502d1111b233c2cf4064828-channeld-chan#5: backtrace: common/status.c:214 (status_failed) 0x56513af63fca
2021-09-30T21:51:48.319Z **BROKEN** 021f98b9898720f8633c93faf0aa54ab399d277464e502d1111b233c2cf4064828-channeld-chan#5: backtrace: common/subdaemon.c:25 (status_backtrace_exit) 0x56513af640dc
2021-09-30T21:51:48.319Z **BROKEN** 021f98b9898720f8633c93faf0aa54ab399d277464e502d1111b233c2cf4064828-channeld-chan#5: backtrace: common/daemon.c:55 (crashdump) 0x56513af5773b
2021-09-30T21:51:48.319Z **BROKEN** 021f98b9898720f8633c93faf0aa54ab399d277464e502d1111b233c2cf4064828-channeld-chan#5: backtrace: (null):0 ((null)) 0x7f09e7c1e20f
2021-09-30T21:51:48.319Z **BROKEN** 021f98b9898720f8633c93faf0aa54ab399d277464e502d1111b233c2cf4064828-channeld-chan#5: backtrace: (null):0 ((null)) 0x7f09e7c1e18b
2021-09-30T21:51:48.319Z **BROKEN** 021f98b9898720f8633c93faf0aa54ab399d277464e502d1111b233c2cf4064828-channeld-chan#5: backtrace: (null):0 ((null)) 0x7f09e7bfd858
2021-09-30T21:51:48.320Z **BROKEN** 021f98b9898720f8633c93faf0aa54ab399d277464e502d1111b233c2cf4064828-channeld-chan#5: backtrace: (null):0 ((null)) 0x7f09e7bfd728
2021-09-30T21:51:48.320Z **BROKEN** 021f98b9898720f8633c93faf0aa54ab399d277464e502d1111b233c2cf4064828-channeld-chan#5: backtrace: (null):0 ((null)) 0x7f09e7c0ef35
2021-09-30T21:51:48.320Z **BROKEN** 021f98b9898720f8633c93faf0aa54ab399d277464e502d1111b233c2cf4064828-channeld-chan#5: backtrace: channeld/channeld.c:409 (set_channel_type) 0x56513af4762c
2021-09-30T21:51:48.320Z **BROKEN** 021f98b9898720f8633c93faf0aa54ab399d277464e502d1111b233c2cf4064828-channeld-chan#5: backtrace: channeld/channeld.c:3109 (peer_reconnect) 0x56513af4c57c
2021-09-30T21:51:48.320Z **BROKEN** 021f98b9898720f8633c93faf0aa54ab399d277464e502d1111b233c2cf4064828-channeld-chan#5: backtrace: channeld/channeld.c:3826 (init_channel) 0x56513af4cdb3
2021-09-30T21:51:48.321Z **BROKEN** 021f98b9898720f8633c93faf0aa54ab399d277464e502d1111b233c2cf4064828-channeld-chan#5: backtrace: channeld/channeld.c:3894 (main) 0x56513af4cfa5
2021-09-30T21:51:48.321Z **BROKEN** 021f98b9898720f8633c93faf0aa54ab399d277464e502d1111b233c2cf4064828-channeld-chan#5: backtrace: (null):0 ((null)) 0x7f09e7bff0b2
2021-09-30T21:51:48.321Z **BROKEN** 021f98b9898720f8633c93faf0aa54ab399d277464e502d1111b233c2cf4064828-channeld-chan#5: backtrace: (null):0 ((null)) 0x56513af4657d
2021-09-30T21:51:48.321Z **BROKEN** 021f98b9898720f8633c93faf0aa54ab399d277464e502d1111b233c2cf4064828-channeld-chan#5: backtrace: (null):0 ((null)) 0xffffffffffffffff
2021-09-30T21:51:48.321Z **BROKEN** 021f98b9898720f8633c93faf0aa54ab399d277464e502d1111b233c2cf4064828-channeld-chan#5: STATUS_FAIL_INTERNAL_ERROR: FATAL SIGNAL
2021-09-30T21:51:48.321Z UNUSUAL 021f98b9898720f8633c93faf0aa54ab399d277464e502d1111b233c2cf4064828-channeld-chan#5: Status closed, but waitpid 516096 says No child processes
2021-09-30T21:51:48.321Z INFO    021f98b9898720f8633c93faf0aa54ab399d277464e502d1111b233c2cf4064828-chan#5: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (-1)

In the journalctl there is another clue:

Sep 30 23:22:22 nadmin-BHYVE sh[601525]:         ???:0
Sep 30 23:22:25 nadmin-BHYVE sh[601624]: lightning_channeld: channeld/channeld.c:409: set_channel_type: Assertion `!channel->option_static_remotekey' failed.
Sep 30 23:22:25 nadmin-BHYVE sh[601624]: lightning_channeld: FATAL SIGNAL 6 (version 0.10.1)
Sep 30 23:22:25 nadmin-BHYVE sh[601624]: 0x55d1eeb0169b send_backtrace

The machine is an Ubntu VM on FreeBSD using the bhyve supervisor:

Linux nadmin-BHYVE 5.11.0-37-generic #41~20.04.2-Ubuntu SMP Fri Sep 24 09:06:38 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

getinfo:

{
   "id": "020fc7e62c9956a34bcdba309a4809ccfd7adcae4b8b644cac81b4dc58d43a2479",
   "alias": "LNroadz",
   "color": "020fc7",
   "num_peers": 5,
   "num_pending_channels": 0,
   "num_active_channels": 4,
   "num_inactive_channels": 0,
   "address": [
      {
         "type": "ipv4",
         "address": "127.0.0.1",
         "port": 9736
      },
      {
         "type": "torv3",
         "address": "dt2xgzg4wg5zksqlzxmtqx3yx4dnic7hlysvycfvunbrgoatfzsnosyd.onion",
         "port": 9736
      }
   ],
   "binding": [
      {
         "type": "ipv4",
         "address": "127.0.0.1",
         "port": 9736
      }
   ],
   "version": "0.10.1",
   "blockheight": 702931,
   "network": "bitcoin",
   "msatoshi_fees_collected": 4000,
   "fees_collected_msat": "4000msat",
   "lightning-dir": "/home/bitcoin/.lightning/bitcoin"
}

Note how getinfo shows 4 active channels, but one is surely down as in listpeers:

 {
         "id": "021f98b9898720f8633c93faf0aa54ab399d277464e502d1111b233c2cf4064828",
         "connected": false,
         "channels": [
            {
               "state": "CHANNELD_NORMAL",
cl listpeers | grep connected -B1
         "id": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
         "connected": true,
--
               "status": [
                  "CHANNELD_NORMAL:Reconnected, and reestablished.",
--
         "id": "03a7c61c056023c804c6d63693345e00ed8e2b28c8d2e0c455964bfff31128df40",
         "connected": true,
--
               "status": [
                  "CHANNELD_NORMAL:Reconnected, and reestablished.",
--
         "id": "021f98b9898720f8633c93faf0aa54ab399d277464e502d1111b233c2cf4064828",
         "connected": false,
--
         "id": "0253029d39ca97c4d271af8926b4f3fec37a3aa87b5e538329845f2fc76146fa28",
         "connected": true,
--
               "status": [
                  "CHANNELD_NORMAL:Reconnected, and reestablished.",

Also cannot switch off this flood of**BROKEN** messages from the log:

$ lightningd --help | grep log
--log-level <arg>                                 log level (io, debug, info, unusual, broken) [:prefix] (default: info)

Log are 260MB after 6 h,have set up logrotate.

UPDATE: building the latest master got rid of the BROKEN messages and journal errors, so it now the same as for fiatjaf. Endlessly repeating:

Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died

and the channel disconnected despite shown active in getinfo.

Will wait for a comment here before attempting to close.

UPDATE2: couple of hours later and another peer to this node reported the same issue. By that time the channel with the originally problematic peer has reconnected, but now there was the same error flood with this new one.

The operator restarted and seemingly the issue is now gone. All channels are active (getinfo) and connected (listpeers).

The common thing is that both peers are c-lightning nodes with the dual-funding option active so channels are DUALOPENED. Only one channel was actually funded from two sources.

Curious if @fiatjaf still has the error? Could you tell if that channel was using dual funding too?

fiatjaf commented 2 years ago

Sorry, I didn't see you asking me. I still have the error, but I think it's unrelated to yours. My error doesn't cause crashes, and it is happening on basically all channels (not sure it's all, but a ton of channels). And they're not dual-funded.

jairunet commented 1 year ago

Hello all,

Got here by a google search of the issue I am seeing. After opening two additional channels in a brand new Embassy + Core Lightning implementation, I started noticing that the channels become inactive intermittently looking at the core lightning logs I found similar output as what everyone is showing above in previous issues:

2023-01-30T07:17:51+09:00 2023-01-29T22:17:51.275Z UNUSUAL plugin-plugin.js: Client disconnected, total clients: 0 2023-01-30T07:18:03+09:00 2023-01-29T22:18:03.271Z UNUSUAL plugin-plugin.js: Client disconnected, total clients: 0 2023-01-30T07:18:05+09:00 2023-01-29T22:18:05.914Z INFO 030cab0dead43dfa95173d6585401c21407c7ab3df326756cd5144c85effbfb110-chan#2: Peer transient failure in CHANNELD_NORMAL: Disconnected 2023-01-30T07:18:14+09:00 2023-01-29T22:18:14.732Z UNUSUAL plugin-plugin.js: Client disconnected, total clients: 0 2023-01-30T07:18:16+09:00 2023-01-29T22:18:16.150Z INFO 0308fd0d35fe851aeb67e0a328f2b8cc33d464beeb521f126b00dfeaf94f769bca-chan#3: Peer transient failure in CHANNELD_NORMAL: Disconnected 2023-01-30T07:18:25+09:00 2023-01-29T22:18:25.141Z UNUSUAL plugin-plugin.js: Client disconnected, total clients: 0 2023-01-30T07:18:28+09:00 2023-01-29T22:18:28.030Z UNUSUAL plugin-plugin.js: Client disconnected, total clients: 0 2023-01-30T07:18:29+09:00 2023-01-29T22:18:29.813Z UNUSUAL plugin-plugin.js: Client disconnected, total clients: 0 2023-01-30T07:19:25+09:00 2023-01-29T22:19:25.217Z UNUSUAL plugin-plugin.js: Client disconnected, total clients: 0 2023-01-30T07:21:41+09:00 2023-01-29T22:21:41.805Z UNUSUAL plugin-plugin.js: Client disconnected, total clients: 0 2023-01-30T07:23:59+09:00 2023-01-29T22:23:59.373Z INFO 030d7a2ec0aae5d52efc433df8903ace2137db96833f3b966f795f02562cd879ec-chan#1: Peer transient failure in CHANNELD_NORMAL: Disconnected 2023-01-30T07:24:01+09:00 2023-01-29T22:24:01.613Z UNUSUAL plugin-plugin.js: Client disconnected, total clients: 0

I am running Embassy + Core Lightning on a VM with 8GB of RAM and 4 virtual CPUs.

cdecker commented 1 year ago

There have been a couple of issues we've addressed in the upcoming release. Specifically I think this may be related with Lnd being unable to handle warning messages that are not related to a specific channel. These were sent out in response of outdated gossip being forwarded to us, causing Lnd to get upset and disconnect, only for us to reconnect, ask for gossip, get an unparseable gossip message a couple seconda later again, sending out a warning and then causing Lnd to get upset at us again.

See https://github.com/ElementsProject/lightning/pull/5925 and https://github.com/ElementsProject/lightning/pull/5897

(Or it might be me seeing a new hammer and deciding everything's a nail)

jairunet commented 1 year ago

My apologies but I am not understanding what the solution is. To stand by for the next release of Core Lightning?

Just to avoid ambiguity, I am using Core Lightning and I believe the peers I am connected to are using LND. I also believe we are all connected via TOR and not clearnet.

Sincerely,

jairunet commented 1 year ago

I would also say that the problem is not much the logs, but what is really happening to the channels, the RTL application keeps showing the channels as inactive every few minutes. I am not sure what the peer is seeing regarding the channels though, but on my end, the channels appear and disappear constantly from active to inactive.

Sincerely,

cdecker commented 1 year ago

To test out the resolution you'd have to either backport the two patches or run master until we get the release cut. The latter option would actually help us verify that the patches work and that the release is stable (participating in the RC process).

The issue as described is simple: CLN sends a warning due to outdated gossip, which lnd is sending us, but lnd does not understand the warning, and so closes the connection resulting in the logs and the flipflopping of the channel. The patches are a workaround, since the issue is on lnd's side (not understanding the warning), since we simply no longer send warnings to them.