FlowFuse / flowfuse

Build bespoke, flexible, and resilient manufacturing low-code applications with FlowFuse and Node-RED
https://flowfuse.com
Other
263 stars 63 forks source link

Alert email was not sent when NR instance crashed #4238

Open robmarcer opened 1 month ago

robmarcer commented 1 month ago

Current Behavior

This instance has crashed and did not recover on 20/07/24 17:25:09 (UTC + 6) - https://app.flowfuse.com/instance/3f3be290-6f82-4883-9372-d54ff81a2b0c/logs

I have FF configured to send me alerts when this happens - https://app.flowfuse.com/instance/3f3be290-6f82-4883-9372-d54ff81a2b0c/settings/alerts

I didn't get an alert - https://d.pr/i/Qr6qVe

Expected Behavior

I should have received an email to alert me that the instance was down.

Steps To Reproduce

I don't have a process to manually crash this instance. If we need to do that for testing that is OK as long as the instance is brought back online within 5 minutes.

Environment

Have you provided an initial effort estimate for this issue?

I have provided an initial effort estimate

robmarcer commented 1 month ago

I have checked my spam.

knolleary commented 1 month ago

This instance is regularly failing the heartbeat check during its normal operation.

At 20/07/2024 11:24:56, it failed enough heartbeat checks in a row for the launcher to decide the instance was hung - so attempted to restart it.

It then hit the context issue so failed to restart. The context error causes the instance to fail in a way the launcher doesn't handle properly - so it didn't reattempt to restart. The context error is tracked in https://github.com/FlowFuse/flowfuse/issues/4230

The sending of email alerts is based on the audit-log events generated by the launcher. The question here is why did the hang restart at 11:24:56 not get logged as an audit event.

20/07/2024 10:14:48  [system]  Node-RED health check failed: TimeoutError: Timeout awaiting 'request' for 6999ms (http://127.0.0.1:1880/)
20/07/2024 10:28:41  [system]  Node-RED health check failed: TimeoutError: Timeout awaiting 'request' for 6999ms (http://127.0.0.1:1880/)
20/07/2024 10:28:48  [system]  Node-RED health check failed: TimeoutError: Timeout awaiting 'request' for 6999ms (http://127.0.0.1:1880/)
20/07/2024 10:42:33  [system]  Node-RED health check failed: TimeoutError: Timeout awaiting 'request' for 6999ms (http://127.0.0.1:1880/)
20/07/2024 10:42:41  [system]  Node-RED health check failed: TimeoutError: Timeout awaiting 'request' for 6999ms (http://127.0.0.1:1880/)
20/07/2024 10:56:33  [system]  Node-RED health check failed: TimeoutError: Timeout awaiting 'request' for 6999ms (http://127.0.0.1:1880/)
20/07/2024 10:56:41  [system]  Node-RED health check failed: TimeoutError: Timeout awaiting 'request' for 6999ms (http://127.0.0.1:1880/)
20/07/2024 10:59:11  [error]   {"length":213,"name":"error","severity":"ERROR","code":"23505","detail":"Key (\"DashboardVersion\", \"Installs\")=(db1, 1129) already exists.","schema":"public","table":"Dashboards","constraint":"unique","file":"nbtinsert.c","line":"670","routine":"_bt_check_unique"}
20/07/2024 10:59:11  [error]   {"length":212,"name":"error","severity":"ERROR","code":"23505","detail":"Key (\"DashboardVersion\", \"Installs\")=(db2, 149) already exists.","schema":"public","table":"Dashboards","constraint":"unique","file":"nbtinsert.c","line":"670","routine":"_bt_check_unique"}
20/07/2024 11:10:41  [system]  Node-RED health check failed: TimeoutError: Timeout awaiting 'request' for 6999ms (http://127.0.0.1:1880/)
20/07/2024 11:10:48  [system]  Node-RED health check failed: TimeoutError: Timeout awaiting 'request' for 6999ms (http://127.0.0.1:1880/)
20/07/2024 11:24:41  [system]  Node-RED health check failed: TimeoutError: Timeout awaiting 'request' for 6999ms (http://127.0.0.1:1880/)
20/07/2024 11:24:48  [system]  Node-RED health check failed: TimeoutError: Timeout awaiting 'request' for 6999ms (http://127.0.0.1:1880/)
20/07/2024 11:24:56  [system]  Node-RED health check failed: TimeoutError: Timeout awaiting 'request' for 6999ms (http://127.0.0.1:1880/)
20/07/2024 11:24:56  [system]  Node-RED hang detected.
20/07/2024 11:24:56  [system]  Stopping Node-RED
20/07/2024 11:24:56  [info]    Stopping flows
20/07/2024 11:24:56  [info]    Stopped flows
20/07/2024 11:24:57  [system]  Node-RED exited rc=0 signal=null
20/07/2024 11:24:57  [system]  Node-RED unexpectedly stopped after: 69976s
20/07/2024 11:24:57  [system]  Starting Node-RED
20/07/2024 11:24:59  [info]    Welcome to Node-RED
===================

20/07/2024 11:24:59  [info]    Node-RED version: v3.0.2
20/07/2024 11:24:59  [info]    Node.js  version: v16.20.1
20/07/2024 11:24:59  [info]    Linux 5.10.205-195.804.amzn2.aarch64 arm64 LE
20/07/2024 11:25:01  [info]    Loading palette nodes
20/07/2024 11:25:02  [info]    FlowForge HTTP Authentication Plugin loaded
20/07/2024 11:25:02  [info]    FlowForge Team Library Plugin loaded
20/07/2024 11:25:02  []        FlowForge Light Theme Plugin loaded
20/07/2024 11:25:02  []        FlowForge Dark Theme Plugin loaded
20/07/2024 11:25:04  [system]  Node-RED health check failed: RequestError: connect ECONNREFUSED 127.0.0.1:1880 (http://127.0.0.1:1880/)
20/07/2024 11:25:05  [info]    Dashboard version 3.5.0 started at /ui
20/07/2024 11:25:06  [info]    Settings file  : /data/settings.js
20/07/2024 11:25:06  [info]    Context store  : 'memory' [module=memory]
20/07/2024 11:25:06  [info]    Context store  : 'persistent' [module=custom]
20/07/2024 11:25:09  [error]   Failed to start server:
20/07/2024 11:25:09  [error]   Error: Error loading context store: TimeoutError: Timeout awaiting 'request' for 3000ms
    at /usr/src/node-red/node_modules/@node-red/runtime/lib/nodes/context/index.js:180:15
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
20/07/2024 11:25:09  [system]  Node-RED exited rc=0 signal=null
22/07/2024 15:23:18  [system]  Loading project settings
22/07/2024 15:23:19  [system]  Target state is 'running'
22/07/2024 15:23:19  [system]  Starting Node-RED
hardillb commented 1 month ago

There wouldn't be an alert for a hang detected, because it would try an restart (as it did).

We only raise "crashed" events if the exit code is not 0 which it is in all these cases as even with the uncaught exception it has a return code of 0.

https://github.com/FlowFuse/nr-launcher/blob/08b4ad0ce251b0edd06eca160e994d3bdd0512af/lib/launcher.js#L581-L603