oscgonfer commented 1 year ago

This issue is to open the discussion about health metrics for a device. Currently we see some common issues when devices are deployed, such as connectivity issues or hardware problems to name the most common ones. We need an easier debugging process for the users, which can be provided by some metrics and analytics of the data, and ad-hoc physical device metrics.

Initially, we are addressing this issue offline, with custom requests to the API, but down the line, the process should be integrated in the platform for an easier debugging.

To start with this, we suggest adding a property to the device indicating the device health, in which we can collect various metrics, some calculated in the physical device side, and some on the platform side. Current proposals:

Platform checks

Total number of points stored, versus maximum amount (based on device creation and last update). For this to work, we would need to either assume default interval (1' in sensors except PM), or have a way to retrieve that from the physical device via hardware_info for instance
Delta between reading time and reception time: this can indicate issues regarding connectivity, but it needs to have the information of the publication interval to make sense. Currently, as we understand, ingestion time is not stored, so probably it's not worth including it
Missing sensors (combination between platform and device check) could highlight particular issues with the hardware if a sensor (or all the sensors) disappears.

Firmware checks

This could be sent on a /device/<token>/health mqtt topic, and ingested on the health table for later. Could be sent ad-hoc, or on boot:

Missing sensors, as above
Connectivity timeouts
SD card issues
Too-frequent resets
Last reset reason (available via firmware)
Reason for WARNING state of the device

@pral2a @vicobarberan please provide inputs to build it progressively.

oscgonfer commented 1 year ago

Adding to this topic, a possibility would be to implement simple device metrics, as already suggested here https://github.com/fablabbcn/smartcitizen-api/issues/100#issuecomment-446579722 for those checks that can be done in platform.

A proposal could be to add a health table linked to the device which would contain:

health:
    # on device data ingestion, calculated by the platform
    total_data_points:  # number of data points in total
    data_gaps: #% of data gaps in the whole period based on sample interval (to retrieve from hardware info?)
    missing_sensors: # list of sensors that have been present, but that aren't anymore
    # filled from a health topic on the mqtt. JSON directly to allow flexibility
    hardware_report: #json sent directly from the hardware

Data gaps / completeness

To be done at ingestion time by ahoy or similar library. The kit's firmware will post the intervals for reading and publication on boot or config change (TBC), on a /device/<token>/config topic that would fill a config table per device.

This could also provide a metric that represents the variability of the posts interval and raise a flag for a sensor that is not posting data regularly.

Missing sensors

The kit's firmware will send data normally, and the platform needs to know what to expect. This is now done by blueprints (kits) but we would like to change this as discussed in https://github.com/fablabbcn/smartcitizen-api/issues/241. This would present a list of sensors to the user, on the onboarding or on the kit edit page (device edit) in which the user can select which sensors are to be expected, and whether or not a notification should be sent in case one of them is not received after a certain threshold has been passed (related to the reading/publication intervals from above).

The user could select notifications in this page, and mark sensors in the front end for misbehaving sensors:

imagen

Hardware report

The kit's firmware would post at least these new sensors:

WiFi RSSI: 'String'
rcause: 'String'
sd-card status: 'String'

These shouldn't be presented in the frontend to avoid confusion, but would be supporting health diagnosis.

oscgonfer commented 1 year ago

Summary of action points for now:

[ ] Check availability of simple metrics that can be gathered in RoR application @timcowlishaw
[ ] Assess what needs to be done externally and think of architecture for triggering that (RPC?)
[ ] Use current hardware_info table and mqtt topic for prototyping metrics coming from hardware directly

oscgonfer commented 9 months ago

fablabbcn / smartcitizen-api

Health metrics for device #238

Platform checks

Firmware checks

Data gaps / completeness

Missing sensors

Hardware report

288