1. Monitoring Overview
• Track the up/down status of up to 1000 Raspberry Pi nodes in near real-time, with a resolution of 2 seconds.
• Ensure the current status of each node is known within 120 seconds, with the ability to adjust this timer dynamically.
• Maintain minimal impact on network bandwidth and database storage, using lightweight data transmission and efficient caching strategies.
2. Data Storage and Handling
• Use InfluxDB v1.x for storing aggregated data.
• Cache status data locally on each Pi in case of network disruptions, and ensure data can be backfilled or calculated correctly if the Pi reboots.
• Record timestamps in UTC and use NTP to synchronize all Pi nodes, ensuring consistent time across all devices.
• Ensure settings for timers and resolution intervals are stored in the database for easy updates.
3. System Resilience and Health Monitoring
• Implement a mechanism to detect if the monitoring script has stalled or crashed and attempt to recover or send an alert.
• Provide the ability to mark nodes as “suspended” (maintenance mode), with a database flag that the script will check before sending status updates.
4. Alerting System
• Design an abstraction layer for alerts, with current basic functionality (like email) and future support for integration with systems like PagerDuty and Slack.
• Down status should trigger alerts if no heartbeats are received within 120 seconds, with flexibility to adjust this timing.
5. Grafana Integration and Visualization
• Use Grafana to display the current up/down status of nodes, using intuitive panels such as single-stat or status panels.
• Create a time-series visualization of up/down events, similar to the SamKnows display, that shows when outages occurred.
• Include options for users to select and analyze data over different time periods dynamically.
6. Performance Considerations
• Ensure the script is lightweight to avoid overloading the Raspberry Pi nodes, which have limited resources (Raspberry Pi 4B with 1GB RAM and 16GB SD cards running other tasks).
• Use efficient, non-blocking programming techniques to maintain performance on the Pis.
7. Future-Proofing
• Allow for easy integration with future alerting systems like PagerDuty or Slack.
• Provide an intuitive configuration setup for scaling or modifying monitoring parameters.
Specification