Closed KevinWassermann94 closed 1 year ago
Note that @NebraLtd/cloud-team is currently planning to implement diagnostics streaming into the Dashboard as part of https://github.com/NebraLtd/hm-dashboard/issues/1282. Be sure to keep an eye on that and make sure the approach here is consistent (although you still don't need to actually implement streaming as part of this ticket).
sample entries form bigquery. network_watchdog_events.csv
@pritamghanghas can you make sure you initialize logging, like this: https://github.com/NebraLtd/product-management/blob/20911395b377a86813e7b18e56484ab250ff22e9/analysis/chargify-read-site-stats/main.py#L17
@marvinmarnold @KevinWassermann94 @pritamghanghas - once this is done, let's create a follow-up ticket so we can also stream this data through our broker to the dashboard.
@kashifpk This will be the follow up: https://github.com/NebraLtd/hm-dashboard/issues/1338
@KevinWassermann94 @marvinmarnold I was trying to get the 1 hour heartbeat done. But got doubts. These events were supposed to be stored in case of failure to upload due to network or otherwise. How do we plan to use them for uptime if they always arrive.
currently merged code: network state changes are published and network is checked every hour.
@pritamghanghas The idea is to have a constant monitoring of all parameters, not just in case of a failure. We want to use this data to be able to do some closer troubleshooting.
As per the AC it was required to be uploaded every hour and queued up if the connection is unavailable
Brief implementation details and tail of logs verifying that are recorded here
In order to test this feature, disconnect the Ethernet and Wifi from your unit and leave it running for 10 hours. We expect a restart of the network manager after 1 hour, after 3 hours a reboot and if the system is not rebooting a force reboot in 9 hours. The event upload logs can be found and verified in diagnostic container logs as showing in the document. The events log should show following sequence being uploaded when the network comes back. "event_type": "NETWORK_DISCONNECTED", "action_type": "ACTION_NM_RESTART" "event_type": "NETWORK_DISCONNECTED", "action_type": "ACTION_SYSTEM_REBOOT" "event_type": "NETWORK_DISCONNECTED", "action_type": "ACTION_SYSTEM_REBOOT_FORCED" (only if soft reboot is not working, which is unlikely scenario) "event_type": "NETWORK_INTERNET_CONNECTED", "action_type": "ACTION_NONE"
@pritamghanghas were heartbeats implemented ultimately? Will we continue to get events even if the miner remains online?
Here are two queries that seem to provide useful info based on this sample data
SELECT
day,
avg(max_uptime) as avg_uptime,
max(max_uptime) as max_uptime
FROM (
SELECT
day,
serial,
max(uptime_hours) as max_uptime
FROM (
SELECT
date(generated_ts) as day,
serial,
uptime_hours
FROM `nebra-production.hotspot_events_data.events`
where event_type = 'HEARTBEAT'
)
group by serial, day
)
group by day
-- Num state changes by day
SELECT
event_type,
day,
count(*) as num_events,
avg(uptime_hours) as avg_uptime
FROM (
SELECT
date(generated_ts) as day,
event_type,
uptime_hours
FROM `nebra-production.hotspot_events_data.events`
where event_type != 'HEARTBEAT'
)
group by event_type, day
The region_override seems to be empty on all logs. I'm currently verifying the failed containers state.
Could you clarify what packet_errors mean? E.g. if packet_errors = 446 what does that number imply? It seems like it is retained over runtime?
I checked. helium_testnet fleet doesn't use region_override. It will be other than empty only if it is set. packet_error is total wifi/ethernet errors reported by kernel. Yes this number will keep accumulating till a reboot.
The devices have a local region_override they receive from the Helium miner. It's the frequency they operate in
Failed containers are showing properly. I am happy to merge and fix the region_override seperately
@KevinWassermann94 commented on Wed Jun 22 2022
Mechanism for copying data into BigQuery should be able to be extended to also copy the data to the Dashboard. Some options: Cloud Storage -> Functions -> PubSub -> BigQuery/Dashboard, OR Storage -> Functions -> BigQuery -> Dashboard, OR Storage -> Function -> Dashboard and separately to BigQuery.
Acceptance Criteria: