Open lukeelmers opened 3 years ago
Pinging @elastic/kibana-core (Team:Core)
In my opinion, the more generic the solution, the better. Ideally, routes would not need any custom / specialized logic regarding usage collection: we could collect data for all opt-in routes the exact same way, and collecting the exact same things (or at least, all customization should be achievable via configuration options).
If this is acceptable, then we could just handle everything from inside core, by just allowing route owners to add the appropriate options to the route configuration.
We could imagine something like
router.post(
{
path: '/api/hello/dolly',
options: {
authRequired: 'optional',
telemetry: {
enabled: true,
someCustomOption: 'maybe',
},
},
validate: { ... },
},
async (ctx, req, res) => {
// no change required in the handler.
return res.ok({});
}
);
Should we be differentiating between KibanaRequest and requests originating outside of Kibana?
You mean requests fired from inside the app versus from outside the app right (technically everything is a KibanaRequest on the server)? There isn't really a reliable way to dissociate them (the best we're doing atm is using the headers to guess if the request is coming from the app, but it can easily be forged). Ideally, we wouldn't dissociate them, but if we do, we need to determine exactly how we want to do it.
(Especially in option 2: ideally we wouldn't need to update Core every time a new route is added)
I ran Kibana locally and can see 600 messages registering route handler for [...
Do we really want to declare all of them manually? But the following question is
Should we be differentiating between KibanaRequest and requests originating outside of Kibana?
I'd say we are interested in the telemetry for public API routes only. Aren't we? Unfortunately, Kibana doesn't provide a standard mechanism to differentiate between public and internal routes (just a few routes have /internal
prefix in the path). Until that, we can track stats for routes prefixed with /api
(402 endpoints).
Should we be differentiating between KibanaRequests made in default vs non-default spaces?
Why might it be necessary? It's the same entity in the public contract.
How will we handle updating schemas?
Do you mean http
--> https
or validation schemas?
by just allowing route owners to add the appropriate options to the route configuration.
Since we haven't heard any requests to add this functionality, I'm erring on the side of minimizing the API surface until this functionality is requested.
You mean requests fired from inside the app versus from outside the app right (technically everything is a KibanaRequest on the server)?
Yes, I mean understanding requests that are originating from Kibana itself, vs being called externally. I know our method for detecting these via headers isn't bulletproof, but IMHO this is an important piece of data when we are considering making changes to our services... even if it isn't 100% accurate. (We are already doing this for some counters in the CoreUsageStatsClient
)
Should we be differentiating between KibanaRequests made in default vs non-default spaces?
Why might it be necessary? It's the same entity in the public contract.
I only suggested this as we are already doing it in CoreUsageStatsClient
. But personally I don't think this is as useful/important and understanding if it is a KibanaRequest
.
Do you mean http --> https or validation schemas?
Ah, sorry to be vague. I am talking about usage collection / telemetry schemas here.
I'd vote for a generic approach that catches all. Re the schema, I'd rather report it in an array-based way vs. a tree model (final fields to be decided):
// Array model
[
{ method: 'POST', path: '/api/hello/dolly', count: 20, authed: true, daily_timestamp, ... },
{ method: 'POST', path: '/api/login', count: 20, authed: true, daily_timestamp, ... },,
...
]
// Tree model
{
"POST /api/hello/dolly": {
authed: 20,
unauthed: 0,
},
"POST /api/login": {
authed: 20,
unauthed: 10,
},
}
Why using the array model over the tree one?
Q: how are we going to solve actual values that may incur PII issues (i.e.: PUT /api/user/my_username
)? Can we identify that my_username
is an actual var in the registered route path: '/api/user/{id}
?
Alternative: use APM! It already provides the traces (so we can count how many times each route is hit) and the masking option to report /api/user/{*}
Can we identify that my_username is an actual var in the registered route path: '/api/user/{id}?
In the handler, req.route.path
seems to be the resolved path (e.g /api/user/my_username
), but given that the data collection will be performed at the router level (or at least in core), providing the unresolved RouteConfig.path
(e.g /api/user/{id}
) to the collection hook should not be an issue imho.
use APM! It already provides the traces (so we can count how many times each route is hit) and the masking option to report /api/user/{*}
How does this masking
option work exactly?
Also, that means it will be required to have the customer enable APM to be able to collect this data, not sure we're fine with this?
Some API endpoints need to be able to scale to thousands of requests per minute (like apm agent configuration). If these trigger an additional PUT to a ui counter document on every request it adds a lot of extra load to Elasticsearch. If we provide this instrumentation we should change UI Counters to be lazy and e.g. only persist updated counts every 5 minutes.
How does this masking option work exactly?
Not entirely sure, to be fair... After looking at our APM traces, it looks like such masking does not happen when reporting, but it does apply them when analysing:
Also, that means it will be required to have the customer enable APM to be able to collect this data, not sure we're fine with this?
This is a constant blocker question we get to rule out the option to enable APM on customers' instances. However, we're discussing implementing Event Telemetry (and I have a gut feeling that this HTTP routes usage will end up using that feature). So, IMO, we are reinventing the wheel here.
I might be missing some points though, since I've been out for quite a while... cc @thesmallestduck
If we provide this instrumentation we should change UI Counters to be lazy and e.g. only persist updated counts every 5 minutes.
AFAIK, both, Usage Counters and UI Counters bulk upload on a timer-based approach (at least the last time I checked). Can you confirm @Bamieh?
@afharo yes that is indeed the case. UI/Usage counters are buffered in memory then flushed into ES. Batching is done every 5 seconds and can be configured under the kibana.yml config usageCounters.bufferDuration
.
However, we're discussing implementing Event Telemetry (and I have a gut feeling that this HTTP routes usage will end up using that feature). So, IMO, we are reinventing the wheel here.
Fair point. It may be worth waiting until we have a better understanding of what Event Telemetry will be before digging further on this feature request then?
FYI: I noticed APM is using Server Counters to report this. It sounds like a valid approach to me, and we could do something similar in the Core router: https://github.com/elastic/kibana/blob/7f5e6c3d9f26ae05d7f98ccd0b516def88e2d621/x-pack/plugins/apm/server/routes/apm_routes/register_apm_server_routes.ts#L149-L154
What do you think?
It could be a very good quick win in my opinion.
We need to be careful of the implementation of server counters though. We'll pooling the updates right? else an ES request on every endpoint hit could be a performance issue.
We'll pooling the updates right? else an ES request on every endpoint hit could be a performance issue.
Yes, we batch the increments: https://github.com/elastic/kibana/blob/12b245c4e50555c227d569bc54bc64118bd15558/src/plugins/usage_collection/server/usage_counters/usage_counters_service.ts#L108-L128
The only problem with Usage Counters is that the receiving end is not ready yet to index them for easy consumption. Although we may be able to take a shortcut similar to what we've done in the past for other extended indices.
Do you all think that we can close this issue once we merge https://github.com/elastic/kibana/pull/196081? That PR focuses on deprecated APIs. Do we need to track the usage of non-deprecated ones?
We are currently working on a list of best practices for all public REST APIs exposed from Kibana. One recommended best practice is to collect telemetry on the usage of any public API.
We should think about how to make this process as easy as possible for plugin developers. Some options that come to mind (there are probably others):
kibana_usage_collection
plugin to pick up and report.Other considerations:
KibanaRequest
and requests originating outside of Kibana?KibanaRequest
s made in default vs non-default spaces?cc @elastic/kibana-tech-leads