eclipse-hono / hono

Eclipse Hono™ Project
https://eclipse.dev/hono
Eclipse Public License 2.0
452 stars 137 forks source link

Problem with grafana chart for the 'Hono message details' in case of adapter crash. #1434

Open meshkatul-anwer-bosch opened 5 years ago

meshkatul-anwer-bosch commented 5 years ago

Wrong load data is being shown in the grafana-chart for the 'Hono message details' chart in case of adapter pod crash. The scenario can be easily reproduced by killing an adapter pod with '--force --grace-period=0'. The grafana-chart keeps showing the message count rate of the dead pod before it went down.

mbaeuerle commented 4 years ago

To give more background on this issue: image The telemetry load on this HTTP adapter is constantly 50 msg/s but the chart sometimes shows n-times this amount. This could be caused by a wrong calculation including data from already dead pods. The query looks like this:

sum(irate(hono_messages_received_seconds_count{status="forwarded",component_name="hono-http",type="telemetry"}[$__range]))
mbaeuerle commented 4 years ago

An update on this: It seems it does not make sense to use the whole range of the shown time window with [$__range]. Using a big range like this basically means that for interpolation also values way older are used. Using a range of [1m] fixes the issue. See also https://www.robustperception.io/irate-graphs-are-better-graphs:

If irate only looks at the last two points, why do we pass it a much longer period than that? The answer is that you want to limit how far back it'll look to find those two points, as you don't want to inadvertently use data from hours ago.

The Hono example dashboards often also use [$__range]. Maybe it makes sense to check which of them should be changed to use a fixed value like 1m.