AppInsights.HealthCheck pollutes the log?

ava-innersource / Liquid-Application-Framework-1.0-deprecated

Liquid is a framework to speed up the development of microservices

MIT License

25 stars 14 forks source link

AppInsights.HealthCheck pollutes the log? #188

Closed bruno-brant closed 4 years ago

bruno-brant commented 4 years ago

The log is polluted by the health check...

https://github.com/Avanade/Liquid-Application-Framework/blob/fc9edb95da074b98df7a75479633471f230b776d/src/Liquid.OnAzure/Telemetry/AppInsights.cs#L114

bruno-brant commented 4 years ago

Cool fact: since it isn't async, it's the only health check so far that actually works. But, in this case, it also produces a problem.

guilhermeluizsp commented 4 years ago

Wow...doubt this health check is even working correctly.

Events generated by the TelemetryClient are sent to the registered channel that, by default, sends the request to an internal queue to do batch processing and retry logic. The only exceptions that can occur at the time this method (TrackEvent) is being called are those thrown by the initializers or the processor pipeline.

By looking at this code, I assume the developer wanted to check whether the connection to the application insights' backend service is still alive, but that's certainly not how it works.

Perhaps we could use event listeners to catch failures and even generate better contextual information (did the sdk failed to send the message? why? is it because the internal buffer is at its capacity or is it being increased too rapidly? is it too big? are we retrying too much? maybe it's a network issue, and so on....).

bruno-brant commented 4 years ago

Another important thing to note is: do I really want to restart my pod because Telemetry is off? Is it really mission critical? Imagine that a infrastructure change made a mistake and led AppInsights to be disabled or something... would this necessary stop the whole processing of all microservices in the organization?

bruno-brant commented 4 years ago

Perhaps we could use event listeners to catch failures and even generate better contextual information (did the sdk failed to send the message? why? is it because the internal buffer is at its capacity or is it being increased too rapidly? is it too big? are we retrying too much? maybe it's a network issue, and so on....).

I think we can try to do this kind of health checking for every service - we could create counters that are monitored individually for each service and have thresholds that, if surpassed, would result in a negative health.

However, we must take the cascading of failures into consideration (see #194).