elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.68k stars 8.23k forks source link

[discuss] Provide an easy way to add telemetry for registered http routes #112291

Open lukeelmers opened 3 years ago

lukeelmers commented 3 years ago

We are currently working on a list of best practices for all public REST APIs exposed from Kibana. One recommended best practice is to collect telemetry on the usage of any public API.

We should think about how to make this process as easy as possible for plugin developers. Some options that come to mind (there are probably others):

  1. Register a custom route handler context for usage collection that will automatically increment a counter with the name drawn from the route path. This could then be called manually from inside a route handler.
  2. Leverage an approach similar to the existing core usage stats collection, and handle collecting these stats in core each time an API is called, and then exposing them for kibana_usage_collection plugin to pick up and report.

Other considerations:

cc @elastic/kibana-tech-leads

elasticmachine commented 3 years ago

Pinging @elastic/kibana-core (Team:Core)

pgayvallet commented 3 years ago

In my opinion, the more generic the solution, the better. Ideally, routes would not need any custom / specialized logic regarding usage collection: we could collect data for all opt-in routes the exact same way, and collecting the exact same things (or at least, all customization should be achievable via configuration options).

If this is acceptable, then we could just handle everything from inside core, by just allowing route owners to add the appropriate options to the route configuration.

We could imagine something like

  router.post(
    {
      path: '/api/hello/dolly',
      options: {
        authRequired: 'optional',
        telemetry: {
          enabled: true,
          someCustomOption: 'maybe',
        },
      },
      validate: { ... },
    },
    async (ctx, req, res) => {
      // no change required in the handler.
      return res.ok({});
    }
  );

Should we be differentiating between KibanaRequest and requests originating outside of Kibana?

You mean requests fired from inside the app versus from outside the app right (technically everything is a KibanaRequest on the server)? There isn't really a reliable way to dissociate them (the best we're doing atm is using the headers to guess if the request is coming from the app, but it can easily be forged). Ideally, we wouldn't dissociate them, but if we do, we need to determine exactly how we want to do it.

mshustov commented 3 years ago

(Especially in option 2: ideally we wouldn't need to update Core every time a new route is added)

I ran Kibana locally and can see 600 messages registering route handler for [... Do we really want to declare all of them manually? But the following question is

Should we be differentiating between KibanaRequest and requests originating outside of Kibana?

I'd say we are interested in the telemetry for public API routes only. Aren't we? Unfortunately, Kibana doesn't provide a standard mechanism to differentiate between public and internal routes (just a few routes have /internal prefix in the path). Until that, we can track stats for routes prefixed with /api (402 endpoints).

Should we be differentiating between KibanaRequests made in default vs non-default spaces?

Why might it be necessary? It's the same entity in the public contract.

How will we handle updating schemas?

Do you mean http --> https or validation schemas?

by just allowing route owners to add the appropriate options to the route configuration.

Since we haven't heard any requests to add this functionality, I'm erring on the side of minimizing the API surface until this functionality is requested.

lukeelmers commented 3 years ago

You mean requests fired from inside the app versus from outside the app right (technically everything is a KibanaRequest on the server)?

Yes, I mean understanding requests that are originating from Kibana itself, vs being called externally. I know our method for detecting these via headers isn't bulletproof, but IMHO this is an important piece of data when we are considering making changes to our services... even if it isn't 100% accurate. (We are already doing this for some counters in the CoreUsageStatsClient)

Should we be differentiating between KibanaRequests made in default vs non-default spaces?

Why might it be necessary? It's the same entity in the public contract.

I only suggested this as we are already doing it in CoreUsageStatsClient. But personally I don't think this is as useful/important and understanding if it is a KibanaRequest.

Do you mean http --> https or validation schemas?

Ah, sorry to be vague. I am talking about usage collection / telemetry schemas here.

afharo commented 2 years ago

I'd vote for a generic approach that catches all. Re the schema, I'd rather report it in an array-based way vs. a tree model (final fields to be decided):

// Array model
[
  { method: 'POST', path: '/api/hello/dolly', count: 20, authed: true, daily_timestamp, ... },
  { method: 'POST', path: '/api/login', count: 20, authed: true, daily_timestamp, ... },,
  ...
]

// Tree model
{
  "POST /api/hello/dolly": {
    authed: 20,
    unauthed: 0,
  },
  "POST /api/login": {
    authed: 20,
    unauthed: 10,
  },
}

Why using the array model over the tree one?

Q: how are we going to solve actual values that may incur PII issues (i.e.: PUT /api/user/my_username)? Can we identify that my_username is an actual var in the registered route path: '/api/user/{id}?


Alternative: use APM! It already provides the traces (so we can count how many times each route is hit) and the masking option to report /api/user/{*}

pgayvallet commented 2 years ago

Can we identify that my_username is an actual var in the registered route path: '/api/user/{id}?

In the handler, req.route.path seems to be the resolved path (e.g /api/user/my_username), but given that the data collection will be performed at the router level (or at least in core), providing the unresolved RouteConfig.path (e.g /api/user/{id}) to the collection hook should not be an issue imho.

use APM! It already provides the traces (so we can count how many times each route is hit) and the masking option to report /api/user/{*}

How does this masking option work exactly?

Also, that means it will be required to have the customer enable APM to be able to collect this data, not sure we're fine with this?

rudolf commented 2 years ago

Some API endpoints need to be able to scale to thousands of requests per minute (like apm agent configuration). If these trigger an additional PUT to a ui counter document on every request it adds a lot of extra load to Elasticsearch. If we provide this instrumentation we should change UI Counters to be lazy and e.g. only persist updated counts every 5 minutes.

afharo commented 2 years ago

How does this masking option work exactly?

Not entirely sure, to be fair... After looking at our APM traces, it looks like such masking does not happen when reporting, but it does apply them when analysing:

image

Also, that means it will be required to have the customer enable APM to be able to collect this data, not sure we're fine with this?

This is a constant blocker question we get to rule out the option to enable APM on customers' instances. However, we're discussing implementing Event Telemetry (and I have a gut feeling that this HTTP routes usage will end up using that feature). So, IMO, we are reinventing the wheel here.

I might be missing some points though, since I've been out for quite a while... cc @thesmallestduck

If we provide this instrumentation we should change UI Counters to be lazy and e.g. only persist updated counts every 5 minutes.

AFAIK, both, Usage Counters and UI Counters bulk upload on a timer-based approach (at least the last time I checked). Can you confirm @Bamieh?

Bamieh commented 2 years ago

@afharo yes that is indeed the case. UI/Usage counters are buffered in memory then flushed into ES. Batching is done every 5 seconds and can be configured under the kibana.yml config usageCounters.bufferDuration.

pgayvallet commented 2 years ago

However, we're discussing implementing Event Telemetry (and I have a gut feeling that this HTTP routes usage will end up using that feature). So, IMO, we are reinventing the wheel here.

Fair point. It may be worth waiting until we have a better understanding of what Event Telemetry will be before digging further on this feature request then?

afharo commented 2 years ago

FYI: I noticed APM is using Server Counters to report this. It sounds like a valid approach to me, and we could do something similar in the Core router: https://github.com/elastic/kibana/blob/7f5e6c3d9f26ae05d7f98ccd0b516def88e2d621/x-pack/plugins/apm/server/routes/apm_routes/register_apm_server_routes.ts#L149-L154

What do you think?

pgayvallet commented 2 years ago

It could be a very good quick win in my opinion.

We need to be careful of the implementation of server counters though. We'll pooling the updates right? else an ES request on every endpoint hit could be a performance issue.

afharo commented 2 years ago

We'll pooling the updates right? else an ES request on every endpoint hit could be a performance issue.

Yes, we batch the increments: https://github.com/elastic/kibana/blob/12b245c4e50555c227d569bc54bc64118bd15558/src/plugins/usage_collection/server/usage_counters/usage_counters_service.ts#L108-L128

The only problem with Usage Counters is that the receiving end is not ready yet to index them for easy consumption. Although we may be able to take a shortcut similar to what we've done in the past for other extended indices.

afharo commented 1 month ago

Do you all think that we can close this issue once we merge https://github.com/elastic/kibana/pull/196081? That PR focuses on deprecated APIs. Do we need to track the usage of non-deprecated ones?