VeriNexus / verinexus-speedtest

0 stars 0 forks source link

Develop the Up/down Status sub-system #2

Open VeriNexus opened 1 week ago

VeriNexus commented 6 days ago

Specification

1.  Monitoring Overview
•   Track the up/down status of up to 1000 Raspberry Pi nodes in near real-time, with a resolution of 2 seconds.
•   Ensure the current status of each node is known within 120 seconds, with the ability to adjust this timer dynamically.
•   Maintain minimal impact on network bandwidth and database storage, using lightweight data transmission and efficient caching strategies.
2.  Data Storage and Handling
•   Use InfluxDB v1.x for storing aggregated data.
•   Cache status data locally on each Pi in case of network disruptions, and ensure data can be backfilled or calculated correctly if the Pi reboots.
•   Record timestamps in UTC and use NTP to synchronize all Pi nodes, ensuring consistent time across all devices.
•   Ensure settings for timers and resolution intervals are stored in the database for easy updates.
3.  System Resilience and Health Monitoring
•   Implement a mechanism to detect if the monitoring script has stalled or crashed and attempt to recover or send an alert.
•   Provide the ability to mark nodes as “suspended” (maintenance mode), with a database flag that the script will check before sending status updates.
4.  Alerting System
•   Design an abstraction layer for alerts, with current basic functionality (like email) and future support for integration with systems like PagerDuty and Slack.
•   Down status should trigger alerts if no heartbeats are received within 120 seconds, with flexibility to adjust this timing.
5.  Grafana Integration and Visualization
•   Use Grafana to display the current up/down status of nodes, using intuitive panels such as single-stat or status panels.
•   Create a time-series visualization of up/down events, similar to the SamKnows display, that shows when outages occurred.
•   Include options for users to select and analyze data over different time periods dynamically.
6.  Performance Considerations
•   Ensure the script is lightweight to avoid overloading the Raspberry Pi nodes, which have limited resources (Raspberry Pi 4B with 1GB RAM and 16GB SD cards running other tasks).
•   Use efficient, non-blocking programming techniques to maintain performance on the Pis.
7.  Future-Proofing
•   Allow for easy integration with future alerting systems like PagerDuty or Slack.
•   Provide an intuitive configuration setup for scaling or modifying monitoring parameters.