Open 0xB10C opened 8 months ago
Automatically detecting spy-nodes is a possibility too. Here, the anomaly is that they only listen to data from us, but never send us new transactions (or blocks). This is a bit more involved as is requires us to keep track and state of what we send to a peer and what they send us.
Detecting the anomalies as described in https://arxiv.org/pdf/2108.00815v1 should be possible too.
Indeed let's see if that metrics can be used to identify the anomaly.
We should also monitor outbound connections. We expect to always have a minimum of 11 connections. If we have fewer for a longer timeframe or a large drop of outbound connections across multiple nodes at the same time, it's probably an anomaly.
Indeed we can have alerts on that too!
We should also monitor outbound connections. We expect to always have a minimum of 11 connections. If we have fewer for a longer timeframe or a large drop of outbound connections across multiple nodes at the same time, it's probably an anomaly.
Having alerts on individual nodes as well as overall could be a better idea because then we'll know which nodes are experiencing anomalies, any thoughts on that?
We should also monitor outbound connections. We expect to always have a minimum of 11 connections. If we have fewer for a longer timeframe or a large drop of outbound connections across multiple nodes at the same time, it's probably an anomaly.
Having alerts on individual nodes as well as overall could be a better idea because then we'll know which nodes are experiencing anomalies, any thoughts on that?
Yes, sounds good!
I came across this blog post How to use Prometheus to efficiently detect anomalies at scale (based on this talk https://www.youtube.com/watch?v=BTAba-Vq3xE). This looks interesting and something I want to try out.
They published prometheus recoding rules here: https://github.com/grafana/promql-anomaly-detection
The current Grafana dashboards show a the raw numbers from Prometheus (via the
metrics
) tool. Anomaly detection and alerting is not yet implemented.For example:
Here, an anomaly could be a sudden drop in inbound peers connected to one or more peers as in https://b10c.me/observations/05-inbound-connection-flooder-down/. To detect this, a Z-score could be used. If the z-score is above a certain threshold, send an alert.
Here, a spike in outbound and (inbound too) address messages across all nodes could indicate an anomaly. Here a Z-score could be used. Maybe there are other possible ways to explore which can be used to detect anomalies.
This issue can be used for discussion and brainstorming.