Closed MarianRaphael closed 1 year ago
I'm increasing the sizing for this. The requirement here is all about presenting the information to the user. But first we have to do the work to be able to monitor and capture that information in any meaningful way for each of the drivers - localfs/docker/k8s. The solutions will be different for each driver.
Discussion during product sync: priority on Kubernetes
Having thought about this some more and talked to @ppawlowski I have decided the best option may be to add a prometeus endpoint to each instance. We can then either get the launcher to scrape it and make the data available to the forge app or have the forge app do it. (I like the idea of the launcher doing it).
This will work for all platforms not just k8s
The following NR plugin code is a PoC of what would be needed.
const client = require('prom-client')
module.exports = (RED) => {
RED.plugins.registerPlugin('flowfuse-nr-metrics', {
settings: {
'*': { exportable: true }
},
onadd: function () {
const collectDefaultMetrics = client.collectDefaultMetrics
const Registry = client.Registry
const register = new Registry()
collectDefaultMetrics({ register })
RED.httpAdmin.get('/metrics', async function (req, res){
const metrics = await register.metrics()
res.send(metrics)
})
}
})
}
I propose adding this to the nr-launcher.
We can also add our own custom metrics to this.
We may want to pick a different URL to /metrics
This should give us memory usage and a start on CPU metrics (but we may need to scale them to match the stack limits
OK, so I've exposed the CPU and Memory constraints to the nr-launcher but we need to decide when to trigger alerts.
I suggest that if the instance is above 75% of the limit for more than X samples then it should add a message to the audit log.
The question is what the X should be (samples are currently every 5 seconds).
Also when should we reset and resend the alert?
Do we only send it once or if the usage drop bellow for Y samples then goes up again? I don't want to flood the audit log.
@knolleary @MarianRaphael comments?
I'm suggesting averaging cpu and memory over the last minute (12 samples) for the trigger and reporting once until that average falls under.
The trigger point is 75% of the limits
5 seconds is quite often IMO. We should increase this to 15-30 seconds. Regarding alerting I would approach it like all monitoring systems do - use evaluation period. If CPU/memory in 5 minutes period is above warning/critical level - send an alert. Re-send very 15 minuted. I would use same approach for cool down period - if resource is below threshold level for 5 minutes - send cool down info.
One common scenario is that during the installation of npm packages, the resources aren't sufficient. My fear with a 5-minute period is that no info is triggered before the instance crashes. Would it be possible to add an additional rule for 100% utilization that triggers immediately?
Node-RED shells out to npm to install modules, this version is only monitoring the Node-RED process (as it works across all platforms).
At the moment, nothing is going to catch if the separate npm process (or one of it's possible many children [python/make/gcc]) consumes all the memory as the pod will just get killed when it hits the limits.
Verified on Staging
This item is now complete.
Description
Add a feature to the Auditlog that displays alerts when the resources (CPU, Memory) reach 80%, 90%, and 100% utilization for each Node-RED instance. This feature is aimed at assisting users in decision-making about the need for resource upgrades and identifying the root cause of instance crashes.
Epic
https://github.com/flowforge/flowforge/issues/223
User Story
As a FlowFuse User, I want to be informed about the resource utilization of each Node-RED instance so that I can make informed decisions about upgrades and troubleshoot issues proactively.
Acceptance Criteria
Have you provided an initial effort estimate for this issue?
I have provided an initial effort estimate