Azure / azure-iot-pcs-remote-monitoring-dotnet

Azure IoT .NET solution for Remote Monitoring
MIT License
162 stars 95 forks source link

Alerting and graphing break because the RM infrastructure does not respect iothub-creation-time-utc #149

Closed JetstreamRoySprowl closed 5 years ago

JetstreamRoySprowl commented 5 years ago

All of the telemetry reporting pieces in the remote monitoring system fail to respect the iothub-creation-time-utc value, and insist on assigning message times based only on when they are received by the Hub. That is a great default value if iothub-creation-time-utc has not been supplied, but it is a serious bug otherwise.

The problem shows up in systems which may find themselves periodically unable to communicate. If such a system is out of communication for 6 hours, it will deliver all its telemetry in a burst. Then both the remote monitoring dashboard and Time Series Insights will display that 6 hours' worth of messages as if they all happened within a few seconds of each other.

Alerting is also corrupted under these conditions because of the incorrectly assessed times.

Edit: I forgot, this also semantically breaks message query by time.

molly-moen commented 5 years ago

@JetstreamRoySprowl We are investigating the alerting portion of this. For graphing, you can configure your TSI instance to use iothub-creation-time-utc by updating the event source timestamp property name (it starts out as the default, which is the event enqueue time). Follow these instructions to do that: https://docs.microsoft.com/en-us/azure/time-series-insights/time-series-insights-how-to-add-an-event-source-iothub#add-a-new-event-source (instead of add, click on existing iot hub).

JetstreamRoySprowl commented 5 years ago

Thanks, @molly-moen. We're not expecting to use TSI much, and are modifying the ARM WebUI to fit our needs. I got the graphing there to work by sending time in as a message parameter and modifying the web ui's telemetryModel.js, since that was the fastest way to get things working end-to-end. I've also got a way to dance around the query errors, so I'm not blocked.

molly-moen commented 5 years ago

For ASA you would need to change this line in the query: FROM DeviceTelemetry T PARTITION BY PartitionId TIMESTAMP BY T.EventEnqueuedUtcTime To the appropriate time variable. Since we cannot guarantee people will set the creation time we likely will not update the query, and for TSI we also cannot update for the generic deployment as we don't know if people will set this value for their devices.