Open robmarcer opened 4 months ago
I have checked my spam.
This instance is regularly failing the heartbeat check during its normal operation.
At 20/07/2024 11:24:56
, it failed enough heartbeat checks in a row for the launcher to decide the instance was hung - so attempted to restart it.
It then hit the context issue so failed to restart. The context error causes the instance to fail in a way the launcher doesn't handle properly - so it didn't reattempt to restart. The context error is tracked in https://github.com/FlowFuse/flowfuse/issues/4230
The sending of email alerts is based on the audit-log events generated by the launcher. The question here is why did the hang restart at 11:24:56
not get logged as an audit event.
20/07/2024 10:14:48 [system] Node-RED health check failed: TimeoutError: Timeout awaiting 'request' for 6999ms (http://127.0.0.1:1880/)
20/07/2024 10:28:41 [system] Node-RED health check failed: TimeoutError: Timeout awaiting 'request' for 6999ms (http://127.0.0.1:1880/)
20/07/2024 10:28:48 [system] Node-RED health check failed: TimeoutError: Timeout awaiting 'request' for 6999ms (http://127.0.0.1:1880/)
20/07/2024 10:42:33 [system] Node-RED health check failed: TimeoutError: Timeout awaiting 'request' for 6999ms (http://127.0.0.1:1880/)
20/07/2024 10:42:41 [system] Node-RED health check failed: TimeoutError: Timeout awaiting 'request' for 6999ms (http://127.0.0.1:1880/)
20/07/2024 10:56:33 [system] Node-RED health check failed: TimeoutError: Timeout awaiting 'request' for 6999ms (http://127.0.0.1:1880/)
20/07/2024 10:56:41 [system] Node-RED health check failed: TimeoutError: Timeout awaiting 'request' for 6999ms (http://127.0.0.1:1880/)
20/07/2024 10:59:11 [error] {"length":213,"name":"error","severity":"ERROR","code":"23505","detail":"Key (\"DashboardVersion\", \"Installs\")=(db1, 1129) already exists.","schema":"public","table":"Dashboards","constraint":"unique","file":"nbtinsert.c","line":"670","routine":"_bt_check_unique"}
20/07/2024 10:59:11 [error] {"length":212,"name":"error","severity":"ERROR","code":"23505","detail":"Key (\"DashboardVersion\", \"Installs\")=(db2, 149) already exists.","schema":"public","table":"Dashboards","constraint":"unique","file":"nbtinsert.c","line":"670","routine":"_bt_check_unique"}
20/07/2024 11:10:41 [system] Node-RED health check failed: TimeoutError: Timeout awaiting 'request' for 6999ms (http://127.0.0.1:1880/)
20/07/2024 11:10:48 [system] Node-RED health check failed: TimeoutError: Timeout awaiting 'request' for 6999ms (http://127.0.0.1:1880/)
20/07/2024 11:24:41 [system] Node-RED health check failed: TimeoutError: Timeout awaiting 'request' for 6999ms (http://127.0.0.1:1880/)
20/07/2024 11:24:48 [system] Node-RED health check failed: TimeoutError: Timeout awaiting 'request' for 6999ms (http://127.0.0.1:1880/)
20/07/2024 11:24:56 [system] Node-RED health check failed: TimeoutError: Timeout awaiting 'request' for 6999ms (http://127.0.0.1:1880/)
20/07/2024 11:24:56 [system] Node-RED hang detected.
20/07/2024 11:24:56 [system] Stopping Node-RED
20/07/2024 11:24:56 [info] Stopping flows
20/07/2024 11:24:56 [info] Stopped flows
20/07/2024 11:24:57 [system] Node-RED exited rc=0 signal=null
20/07/2024 11:24:57 [system] Node-RED unexpectedly stopped after: 69976s
20/07/2024 11:24:57 [system] Starting Node-RED
20/07/2024 11:24:59 [info] Welcome to Node-RED
===================
20/07/2024 11:24:59 [info] Node-RED version: v3.0.2
20/07/2024 11:24:59 [info] Node.js version: v16.20.1
20/07/2024 11:24:59 [info] Linux 5.10.205-195.804.amzn2.aarch64 arm64 LE
20/07/2024 11:25:01 [info] Loading palette nodes
20/07/2024 11:25:02 [info] FlowForge HTTP Authentication Plugin loaded
20/07/2024 11:25:02 [info] FlowForge Team Library Plugin loaded
20/07/2024 11:25:02 [] FlowForge Light Theme Plugin loaded
20/07/2024 11:25:02 [] FlowForge Dark Theme Plugin loaded
20/07/2024 11:25:04 [system] Node-RED health check failed: RequestError: connect ECONNREFUSED 127.0.0.1:1880 (http://127.0.0.1:1880/)
20/07/2024 11:25:05 [info] Dashboard version 3.5.0 started at /ui
20/07/2024 11:25:06 [info] Settings file : /data/settings.js
20/07/2024 11:25:06 [info] Context store : 'memory' [module=memory]
20/07/2024 11:25:06 [info] Context store : 'persistent' [module=custom]
20/07/2024 11:25:09 [error] Failed to start server:
20/07/2024 11:25:09 [error] Error: Error loading context store: TimeoutError: Timeout awaiting 'request' for 3000ms
at /usr/src/node-red/node_modules/@node-red/runtime/lib/nodes/context/index.js:180:15
at processTicksAndRejections (node:internal/process/task_queues:96:5)
20/07/2024 11:25:09 [system] Node-RED exited rc=0 signal=null
22/07/2024 15:23:18 [system] Loading project settings
22/07/2024 15:23:19 [system] Target state is 'running'
22/07/2024 15:23:19 [system] Starting Node-RED
There wouldn't be an alert for a hang detected, because it would try an restart (as it did).
We only raise "crashed" events if the exit code is not 0
which it is in all these cases as even with the uncaught exception it has a return code of 0
.
Not sure what to do with this Issue.
We know why the email wasn't sent (return code 0). My gut feel is a Node-RED PR to have it return -1 if exiting on an uncaught exception?
Current Behavior
This instance has crashed and did not recover on 20/07/24 17:25:09 (UTC + 6) - https://app.flowfuse.com/instance/3f3be290-6f82-4883-9372-d54ff81a2b0c/logs
I have FF configured to send me alerts when this happens - https://app.flowfuse.com/instance/3f3be290-6f82-4883-9372-d54ff81a2b0c/settings/alerts
I didn't get an alert - https://d.pr/i/Qr6qVe
Expected Behavior
I should have received an email to alert me that the instance was down.
Steps To Reproduce
I don't have a process to manually crash this instance. If we need to do that for testing that is OK as long as the instance is brought back online within 5 minutes.
Environment
Have you provided an initial effort estimate for this issue?
I have provided an initial effort estimate