elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.68k stars 8.23k forks source link

Streamlining telemetry within Observability #36895

Closed makwarth closed 5 years ago

makwarth commented 5 years ago

Motivation for this issue

We currently track Observability telemetry in various ways across the solution UIs, which makes the data hard to compare across solutions. This issue proposes to streamline the Observability telemetry.

How telemetry is currently implemented

Logs UI

Implemented in 6.5. Every click on the Logs UI app will initiate a request to fetch data for the active time period. Each request will increment the telemetry event counter regardless of the request response. If there has ever been more than 5 telemetry events within 24 hours in a given month per unique cluster, the cluster will get included in the telemetry count in the above table.

What the data looks like today:

stack_stats.kibana.plugins.infraops.last_24_hours.hits.logs: <int>

Infra UI

Implemented in 6.5. Every click on the Infra UI app will initiate a request to fetch data for the active time period. Each request will increment the telemetry event counter regardless of the request response. If there has ever been more than 5 telemetry events within 24 hours in a given month per unique cluster, the cluster will get included in the telemetry count in the above table. There's a telemetry event for hosts, docker and kubernetes. As long as any of them has more than 5, the cluster is included in the count.

What the data looks like today:

stack_stats.kibana.plugins.infraops.last_24_hours.hits.hosts: <int>
stack_stats.kibana.plugins.infraops.last_24_hours.hits.docker: <int>
stack_stats.kibana.plugins.infraops.last_24_hours.hits.kubernetes: <int>

APM UI

Implemented in 6.6. Every visit to the Services list page is monitored by telemetry. If there's any services in the list within the past 24 hours, the telemetry event will be set to true. This means only clusters with installed agents will be included in the APM telemetry count.

stack_stats.kibana.plugins.apm.has_any_services: <boolean>
stack_stats.kibana.plugins.apm.services_per_agent.java: <int>
...

Uptime UI

None yet. Scheduled for ~7.2~ 7.3. PR: https://github.com/elastic/kibana/pull/34437

Streamlined implementation

Here's some areas of improvement and streamlining:

  1. The telemetry data is sent up once per day per Kibana instance. That's probably why the telemetry data time range is 24 hours. However, for any data sent up on e.g. a Monday, we'd get a bunch of zeros from the weekend. Looking at a month, that's not a problem, but if want to go more granular, it is a problem. Therefore I propose we change the time range from 24 hours to 1 week.

  2. The telemetry (besides APM) doesn't take into account the response of the queries. I don't think it's so useful to see the count of queries performed in <plugin> as it doesn't say much about actual adoption. For example, a team could be clicking the Logs plugin multiple times during a day without actually using the product. I propose we look at the query response instead before deciding to increment the counter. Only if there's actual data (hosts, logs, etc.) in the response, the telemetry counter should increment. This will give us telemetry data of users who definitely consumed real data in our products. We can use this same data to see if they continue to consume data in the product going forward. (Is the product valuable to them or not?). Later, we can add the new event tracking as well, so that we can tell if users are using core functionality of the products.

  3. The Infra and Logs telemetry is bundled as "infraops" (for legacy reasons). It'd be nice to separate it out as "logs" and infra" as "infraops" is a bit confusing, especially to new comers.

  4. It's be nice to streamline the naming, e.g. stack_stats.kibana.plugins.logs.past_week.hits. "Hits" isn't very explicit, but since it's different per solution, we might just one to go with a common name, like "hits". We'd have to document what "hits" exactly means per product.

Proposal

What the updated telemetry could look like:

stack_stats.kibana.plugins.logs.past_week.hits: <int>
stack_stats.kibana.plugins.infra.past_week.hits: <int>
stack_stats.kibana.plugins.infra.past_week.hosts.hits: <int>
stack_stats.kibana.plugins.infra.past_week.docker.hits: <int>
stack_stats.kibana.plugins.infra.past_week.kubernetes.hits: <int>
stack_stats.kibana.plugins.infra.past_week.metricsexplorer.hits: <int>
stack_stats.kibana.plugins.apm.past_week.hits: <int>
stack_stats.kibana.plugins.apm.past_week.services.hits: <int>
stack_stats.kibana.plugins.apm.past_week.services.java.hits: <int>
...
stack_stats.kibana.plugins.apm.past_week.uptime.hits: <int>
stack_stats.kibana.plugins.apm.past_week.uptime.monitors.hits: <int>

It'd be great to get this out in 7.3 as that's when Uptime add telemetry.

elasticmachine commented 5 years ago

Pinging @elastic/apm-ui

elasticmachine commented 5 years ago

Pinging @elastic/infra-logs-ui

elasticmachine commented 5 years ago

Pinging @elastic/uptime

skh commented 5 years ago

The Infra and Logs telemetry is bundled as "infraops" (for legacy reasons). It'd be nice to separate it out as "logs" and infra" as "infraops" is a bit confusing, especially to new comers.

Agreed. This depends on us splitting the current plugin into two (see https://github.com/elastic/kibana/issues/36680 ), as with telemetry we can't report outside of our namespace, and that's currently infraops.

justinkambic commented 5 years ago

I think initially for Uptime we'd define two fields (which is what our current PR does): stack_stats.kibana.plugins.apm.past_week.uptime.monitors.hits: <int> stack_stats.kibana.plugins.apm.past_week.uptime.monitors.detail.hits: <int>

The only change will be the field name used and modifying the tracking logic to line up with the proposed improvements on this issue.

skh commented 5 years ago

As https://github.com/elastic/kibana/issues/36680 has been put on hold (we won't split into separate InfraUI and LogsUI plugins) we're stuck with the infraops namespace for the time being for both. Switching from a last_24_hours to a past_week interval is not affected by this.

jasonrhodes commented 5 years ago

This issue is now replaced by the implementation issue #39507