Upload watchdog events - Githubissues

KevinWassermann94 commented 2 years ago

@KevinWassermann94 commented on Wed Jun 22 2022

Mechanism for copying data into BigQuery should be able to be extended to also copy the data to the Dashboard. Some options: Cloud Storage -> Functions -> PubSub -> BigQuery/Dashboard, OR Storage -> Functions -> BigQuery -> Dashboard, OR Storage -> Function -> Dashboard and separately to BigQuery.

Acceptance Criteria:

[ ] Payload is JSON containing system uptime(at time of the event), local connectivity, internet connectivity, watchdog event(network manager restart, device reboot, forced reboot), duration of failures(on network manager restart, reboot), timestamp, serial number, variant, firmware version, region override, uptime of containers, container status, packet loss for network interfaces, uuid
[ ] Uploads can be in a public location
[ ] It should be possible to create analytics from this data
[ ] Upload events as soon as possible, store logs offline when miner doesn't have connectivity
[ ] Upload on every network watchdog runtime(every hour)
[ ] Create analytics of the average uptime by day
[ ] Upload not at the flat hour, rather random during the hour

marvinmarnold commented 2 years ago

Note that @NebraLtd/cloud-team is currently planning to implement diagnostics streaming into the Dashboard as part of https://github.com/NebraLtd/hm-dashboard/issues/1282. Be sure to keep an eye on that and make sure the approach here is consistent (although you still don't need to actually implement streaming as part of this ticket).

pritamghanghas commented 1 year ago

sample entries form bigquery. network_watchdog_events.csv

marvinmarnold commented 1 year ago

@pritamghanghas can you make sure you initialize logging, like this: https://github.com/NebraLtd/product-management/blob/20911395b377a86813e7b18e56484ab250ff22e9/analysis/chargify-read-site-stats/main.py#L17

kashifpk commented 1 year ago

@marvinmarnold @KevinWassermann94 @pritamghanghas - once this is done, let's create a follow-up ticket so we can also stream this data through our broker to the dashboard.

KevinWassermann94 commented 1 year ago

@kashifpk This will be the follow up: https://github.com/NebraLtd/hm-dashboard/issues/1338

pritamghanghas commented 1 year ago

@KevinWassermann94 @marvinmarnold I was trying to get the 1 hour heartbeat done. But got doubts. These events were supposed to be stored in case of failure to upload due to network or otherwise. How do we plan to use them for uptime if they always arrive.

currently merged code: network state changes are published and network is checked every hour.

KevinWassermann94 commented 1 year ago

@pritamghanghas The idea is to have a constant monitoring of all parameters, not just in case of a failure. We want to use this data to be able to do some closer troubleshooting.

As per the AC it was required to be uploaded every hour and queued up if the connection is unavailable

pritamghanghas commented 1 year ago

Brief implementation details and tail of logs verifying that are recorded here

In order to test this feature, disconnect the Ethernet and Wifi from your unit and leave it running for 10 hours. We expect a restart of the network manager after 1 hour, after 3 hours a reboot and if the system is not rebooting a force reboot in 9 hours. The event upload logs can be found and verified in diagnostic container logs as showing in the document. The events log should show following sequence being uploaded when the network comes back. "event_type": "NETWORK_DISCONNECTED", "action_type": "ACTION_NM_RESTART" "event_type": "NETWORK_DISCONNECTED", "action_type": "ACTION_SYSTEM_REBOOT" "event_type": "NETWORK_DISCONNECTED", "action_type": "ACTION_SYSTEM_REBOOT_FORCED" (only if soft reboot is not working, which is unlikely scenario) "event_type": "NETWORK_INTERNET_CONNECTED", "action_type": "ACTION_NONE"

marvinmarnold commented 1 year ago

@pritamghanghas were heartbeats implemented ultimately? Will we continue to get events even if the miner remains online?

marvinmarnold commented 1 year ago

Here are two queries that seem to provide useful info based on this sample data

SELECT
  day,
  avg(max_uptime) as avg_uptime,
  max(max_uptime) as max_uptime
FROM (
  SELECT
    day,
    serial,
    max(uptime_hours) as max_uptime
  FROM (
    SELECT 
      date(generated_ts) as day,
      serial,
      uptime_hours
    FROM `nebra-production.hotspot_events_data.events`
    where event_type = 'HEARTBEAT'
  )
  group by serial, day
)
group by day

Screen Shot 2022-08-02 at 2 14 51 PM

-- Num state changes by day
SELECT
  event_type,
  day,
  count(*) as num_events,
  avg(uptime_hours) as avg_uptime
FROM (
  SELECT 
    date(generated_ts) as day,
    event_type,
    uptime_hours
  FROM `nebra-production.hotspot_events_data.events`
  where event_type != 'HEARTBEAT'
)
group by event_type, day

Screen Shot 2022-08-02 at 2 11 54 PM

KevinWassermann94 commented 1 year ago

The region_override seems to be empty on all logs. I'm currently verifying the failed containers state.

Could you clarify what packet_errors mean? E.g. if packet_errors = 446 what does that number imply? It seems like it is retained over runtime?

pritamghanghas commented 1 year ago

I checked. helium_testnet fleet doesn't use region_override. It will be other than empty only if it is set. packet_error is total wifi/ethernet errors reported by kernel. Yes this number will keep accumulating till a reboot.

KevinWassermann94 commented 1 year ago

The devices have a local region_override they receive from the Helium miner. It's the frequency they operate in

KevinWassermann94 commented 1 year ago

Failed containers are showing properly. I am happy to merge and fix the region_override seperately

NebraLtd / hm-diag

Upload watchdog events #395