Resource utilization alerts in Auditlog for Node-RED Instances

FlowFuse / flowfuse

Connect, collect, transform, visualise, and interact with your Industrial Data in a single platform. Use FlowFuse to manage, scale and secure your Node-RED solutions.

https://flowfuse.com

Other

278 stars 63 forks source link

Resource utilization alerts in Auditlog for Node-RED Instances #2755

Closed MarianRaphael closed 1 year ago

MarianRaphael commented 1 year ago

Description

Add a feature to the Auditlog that displays alerts when the resources (CPU, Memory) reach 80%, 90%, and 100% utilization for each Node-RED instance. This feature is aimed at assisting users in decision-making about the need for resource upgrades and identifying the root cause of instance crashes.

Epic

https://github.com/flowforge/flowforge/issues/223

User Story

As a FlowFuse User, I want to be informed about the resource utilization of each Node-RED instance so that I can make informed decisions about upgrades and troubleshoot issues proactively.

Acceptance Criteria

[ ] The Auditlog should display a clear and concise alert message when CPU or Memory usage reaches 80%, 90%, and 100%
[ ] If CPU or Memory utilization jumps to 100% and multiple alerts (80%, 90%, 100%) are triggered within the last 3 seconds, only the message with the highest utilization percentage (100%) should be sent.

Have you provided an initial effort estimate for this issue?

I have provided an initial effort estimate

knolleary commented 1 year ago

I'm increasing the sizing for this. The requirement here is all about presenting the information to the user. But first we have to do the work to be able to monitor and capture that information in any meaningful way for each of the drivers - localfs/docker/k8s. The solutions will be different for each driver.

MarianRaphael commented 1 year ago

Discussion during product sync: priority on Kubernetes

hardillb commented 1 year ago

Having thought about this some more and talked to @ppawlowski I have decided the best option may be to add a prometeus endpoint to each instance. We can then either get the launcher to scrape it and make the data available to the forge app or have the forge app do it. (I like the idea of the launcher doing it).

This will work for all platforms not just k8s

The following NR plugin code is a PoC of what would be needed.

const client = require('prom-client')

module.exports = (RED) => {
    RED.plugins.registerPlugin('flowfuse-nr-metrics', {
        settings: {
            '*': { exportable: true }
        },
        onadd: function () {
            const collectDefaultMetrics = client.collectDefaultMetrics
            const Registry = client.Registry
            const register = new Registry()
            collectDefaultMetrics({ register })
            RED.httpAdmin.get('/metrics', async function (req, res){
                const metrics = await register.metrics()
                res.send(metrics)
            })
        }
    })
}

I propose adding this to the nr-launcher.

We can also add our own custom metrics to this.

We may want to pick a different URL to /metrics

hardillb commented 1 year ago

This should give us memory usage and a start on CPU metrics (but we may need to scale them to match the stack limits

hardillb commented 1 year ago

OK, so I've exposed the CPU and Memory constraints to the nr-launcher but we need to decide when to trigger alerts.

I suggest that if the instance is above 75% of the limit for more than X samples then it should add a message to the audit log.

The question is what the X should be (samples are currently every 5 seconds).

Also when should we reset and resend the alert?

Do we only send it once or if the usage drop bellow for Y samples then goes up again? I don't want to flood the audit log.

@knolleary @MarianRaphael comments?

hardillb commented 1 year ago

I'm suggesting averaging cpu and memory over the last minute (12 samples) for the trigger and reporting once until that average falls under.

The trigger point is 75% of the limits

ppawlowski commented 1 year ago

5 seconds is quite often IMO. We should increase this to 15-30 seconds. Regarding alerting I would approach it like all monitoring systems do - use evaluation period. If CPU/memory in 5 minutes period is above warning/critical level - send an alert. Re-send very 15 minuted. I would use same approach for cool down period - if resource is below threshold level for 5 minutes - send cool down info.

MarianRaphael commented 1 year ago

One common scenario is that during the installation of npm packages, the resources aren't sufficient. My fear with a 5-minute period is that no info is triggered before the instance crashes. Would it be possible to add an additional rule for 100% utilization that triggers immediately?

hardillb commented 1 year ago

Node-RED shells out to npm to install modules, this version is only monitoring the Node-RED process (as it works across all platforms).

At the moment, nothing is going to catch if the separate npm process (or one of it's possible many children [python/make/gcc]) consumes all the memory as the pod will just get killed when it hits the limits.

hardillb commented 1 year ago

Verified on Staging

knolleary commented 1 year ago

This item is now complete.