elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.63k stars 8.22k forks source link

[Telemetry] Custom logger appender? #89839

Open afharo opened 3 years ago

afharo commented 3 years ago

Follow up to #89588.

We are constantly switching warn/error logs in the usage-collection/telemetry plugins to debug to avoid making noise to our end-users. It's an easy way to silence those errors. However, it feels to me like debug errors can make it harder to identify any possible issues we may introduce (they are not noisy enough).

How about we configure the loggers for the UsageCollection & Telemetry plugins to be silent when in production mode, but properly warn when in dev mode?

elasticmachine commented 3 years ago

Pinging @elastic/kibana-core (Team:Core)

Bamieh commented 3 years ago

+1 on this. We can add an explicit flag to the configs and default it to true if env.dev

rudolf commented 3 years ago

+1

I think we should aim to explicitly ignore expected errors though. E.g. reading a saved object can timeout when there's network instability, if we know this call will be repeated at some point in the future, we should just always ignore this error and not print any logs, not even a developer needs to see that it happened, it's part of the expected behaviour of the plugin.

Then for the scenarios we're not sure we're handling correctly or that we don't know if we can recover from we can log in development.

LeeDr commented 3 years ago

-1 wait, why would we have network instability when running Elasticsearch and Kibana on the same host?

Bamieh commented 3 years ago

@LeeDr Elasticsearch can be non-responsive for multiple reasons even on the same host. Out of memory/space, node is restarting, kibana index is locked or misconfigured, etc.

Our telemetry collectors are usually the first to scream when kibana is unable to reach ES for any reason. This usually means that customers open tickets that telemetry is causing kibana to fail although telemetry was only the first plugin to log an error in the server logs.

If we silence the telemetry/collection logs on production or put them behind a config we'd be avoiding these situations which helps users identify the real cause of the issue.

I'm assuming users would try to disable telemetry as a first step to debug such cases although the root cause is not related to telemetry, hence we'd be missing out from receiving usage from these clusters.

afharo commented 2 years ago

For implementation details: the POC https://github.com/elastic/kibana/pull/95960 had it implemented. I think that the effort would be to:

  1. copy that piece of logic from the POC to the telemetry and usageCollection plugins
  2. Review the current .debug logs and set them to their appropriate level.
afharo commented 2 years ago

@pjhampton added an important point: We should set Cloud to enable these logs as well by default so we can catch any potential bugs in those controlled environments.