ElementsProject / lightning

Core Lightning — Lightning Network implementation focusing on spec compliance and performance
Other
2.85k stars 901 forks source link

UNUSUAL plugin-autoclean: Killing plugin: timed out before replying to init #5736

Closed grubles closed 1 year ago

grubles commented 1 year ago

CLN at 02f9c2df249c342a282e47179e643471cdf9cd77

2022-11-24T16:06:16.657Z UNUSUAL plugin-autoclean: Killing plugin: timed out before replying to init                                
2022-11-24T16:06:16.658Z **BROKEN** plugin-autoclean: Plugin marked as important, shutting down lightningd!                         
2022-11-24T16:06:16.658Z DEBUG   lightningd: io_break: lightningd_exit                                                              
2022-11-24T16:06:16.659Z DEBUG   lightningd: io_loop_with_timers: main

My HDDs are too slow?

whitslack commented 1 year ago

I get this all the time too. I have to try several times to get CLN to start up successfully. I suspect there is some flawed logic where lightningd sends off an init message to all the plugins, then does a bunch of startup work itself that can take a long time, then services the plugin init timeout timer before getting around to reading the replies from the plugins that have been sitting in the pipe buffers for a cyber-eternity at this point. Really it shouldn't be timing out any plugins if there are any messages pending to be read from them.

vincenzopalazzo commented 1 year ago

Let me try with a master version, right now!

rustyrussell commented 1 year ago

I get this all the time too. I have to try several times to get CLN to start up successfully. I suspect there is some flawed logic where lightningd sends off an init message to all the plugins, then does a bunch of startup work itself that can take a long time, then services the plugin init timeout timer before getting around to reading the replies from the plugins that have been sitting in the pipe buffers for a cyber-eternity at this point. Really it shouldn't be timing out any plugins if there are any messages pending to be read from them.

This theory is more believable than autoclean taking more then a fraction of a second to start. Unless some other plugin is monopolizing lightningd on large nodes. My node may not be large enough to see it though.

rustyrussell commented 1 year ago

OK, this changes us to wait for plugins to init, to test @whitslack's theory:

diff --git a/lightningd/plugin.c b/lightningd/plugin.c
index e0dff8188..f3c3affa0 100644
--- a/lightningd/plugin.c
+++ b/lightningd/plugin.c
@@ -180,6 +180,9 @@ static void check_plugins_initted(struct plugins *plugins)
    for (size_t i = 0; i < tal_count(plugin_cmds); i++)
        plugin_cmd_all_complete(plugins, plugin_cmds[i]);
    tal_free(plugin_cmds);
+
+   if (plugins->startup)
+       io_break(plugins);
 }

 struct command_result *plugin_register_all_complete(struct lightningd *ld,
@@ -1893,6 +1896,11 @@ void plugins_config(struct plugins *plugins)
            plugin_config(p);
    }

+   /* Wait for them to configure, before continuing: large
+    * nodes can take a while to startup! */
+   if (plugins->startup)
+       io_loop_with_timers(plugins->ld);
+
    plugins->startup = false;
 }
vincenzopalazzo commented 1 year ago

I had another monkey diff just because I'm curious what is the timing of complete the init

diff --git a/plugins/autoclean.c b/plugins/autoclean.c
index 81e50352a..cd8d34f34 100644
--- a/plugins/autoclean.c
+++ b/plugins/autoclean.c
@@ -563,13 +563,16 @@ static const char *init(struct plugin *p,
                        cycle_seconds = deprecated_cycle_seconds;
        }

+       plugin_log(p, LOG_INFORM, "set up timer for clean up");
        cleantimer = plugin_timer(p, time_from_sec(cycle_seconds), do_clean_timer, NULL);

+       plugin_log(p, LOG_INFORM, "restore information from datastore!");
        for (enum subsystem i = 0; i < NUM_SUBSYSTEM; i++) {
                rpc_scan_datastore_str(plugin, datastore_path(tmpctx, i, "num"),
                                       JSON_SCAN(json_to_u64, &total_cleaned[i]));
        }

+       plugin_log(p, LOG_INFORM, "well done, cleanup scheduled. Bye!");
        /* Optimization FTW! */
        rpc_enable_batching(p);
        return NULL;

this is my log output

➜  lightning git:(master) ✗ ./lightningd/lightningd --disable-plugin bcli --plugin=/home/vincent/Github/clightning4j/btcli4j/btcli4j-gen.sh --database-upgrade=true                  
2022-11-25T11:34:12.483Z INFO    plugin-manager: /home/vincent/Github/lightning/plugins/bcli: disabled via disable-plugin
2022-11-25T11:34:18.044Z INFO    lightningd: --------------------------------------------------
2022-11-25T11:34:18.044Z INFO    lightningd: Server started with public key 031ba17032cedd2fdde9610e21b2b776b9e209d570a0dc0066c247696ab12aa225, alias LATENTWATCH-c2-4-gece7784-modded (color #031ba1) and lightningd v22.11rc2-4-gece7784-modded
2022-11-25T11:34:18.056Z INFO    plugin-autoclean: set up timer for clean up
2022-11-25T11:34:18.057Z INFO    plugin-autoclean: restore information from datastore!
2022-11-25T11:34:18.057Z INFO    plugin-autoclean: well done, cleanup scheduled. Bye!
grubles commented 1 year ago

It may be the case CLBOSS is monopolizing lightningd. My setup is BTRFS on HDDs so it's inherently slow to begin with, so maybe it's a mix of a slow machine + CLBOSS.

whitslack commented 1 year ago

My setup is PostgreSQL on F2FS on a pair of SSDs in RAID1, definitely not a slow setup. For what it's worth, I don't think I've seen the init timeout since I solved my issue that was causing very slow startups. Still, the point remains that lightningd shouldn't be blaming plugins for not replying in time when lightningd itself is taking longer than the timeout to notice their replies.

rustyrussell commented 1 year ago

Yes, I've been thinking about this: we need a new kind of timeout, one that counts idle time only....