Core usage collector for saved objects per type only collects top 10 types

rudolf commented 2 years ago

The usage collector for saved objects per type uses an terms aggregation without specifying the size which causes it to default to 10 buckets. That means we're only getting the top 10 buckets on each cluster. This is probably sufficient for Core to track which saved object types are most likely to cause scalability problems for users. But it's not enough for teams wanting to use this for telemetry so it probably also blocks us from removing the kibana collector https://github.com/elastic/kibana/issues/99464

https://github.com/elastic/kibana/blob/main/src/plugins/kibana_usage_collection/server/collectors/saved_objects_counts/get_saved_object_counts.ts#L11-L31

Instead I think we should rather just aim to collect all saved object types by using a large size such as 200

elasticmachine commented 2 years ago

Pinging @elastic/kibana-core (Team:Core)

gsoldevila commented 2 years ago

Would it possible to refer to the search.max_buckets constant instead? If tomorrow we have 201 saved object types we might just inadvertently miss one of them.

Since the number of SO types should not grow like crazy I imagine we should be fine specifying a large size:. IDK if there's a performance penalty though.

rudolf commented 2 years ago

yeah I agree that we would want some type of assurance that we're not missing data.

We could use search.max_buckets which defaults to 10k, but like you say there is a performance risk. We could either test the performance with 10k types (and/or speak to the Elasticsearch team) to mitigate the risk or defer this to the future and artificially limit it to a smaller number.

Another variation on afharo's suggestion of a CI test would be to have a development-only assertion that crashes the server if there are > 0 results in the other bucket. In production 3rd party plugins could register more types so we might see other showing up in our telemetry.

Not sure what's the best approach, but for telemetry we should be careful of something that could have a big performance impact, so I'd lean towards a lower value with some kind of way to detect that we're approaching the limit.

afharo commented 2 years ago

After some internal discussion (please @rudolf keep me honest here), our ideal report would include:

The number of SO per built-in type (for all built-in types)
The number of SOs in the others bucket -> This would show if any third-party plugins are registering their own SOs.
[Extra]: A top-level total with the total count of SOs.

The problem with the current terms aggregation is that it works in a Top-N fashion. This means that if a third-party plugin registers a new type of SO and it's very popular, it will show up in the per_type list and it will push other built-in types to the others bucket.

IMO, if we want the built-in types to always show in the detailed per_type view, and the third-party ones to always fall in the others bucket, we may need to change the aggregation.

I can think of 2 ways of doing this:

Using the filters aggregation: we control each bucket by explicitly defining all the buckets for the known built-in SO types.

(My preferred one) Keep our terms aggregation but sort the values using the _score: we need to list in the query all the built-in SO types so they score over non-built-in ones, and show up on top of the terms aggregations.

GET .kibana/_search
{
"size": 0,
"track_total_hits": true,
"query": {
"bool": {
  "must": [{ "match_all": {} }],
  // When there's a must, should is used to score higher (not as an additional filter). We need to list here all the built-in types
  "should": [{ "terms": { "type": [ "map", "VALUE2" ] } }]
}
},
"aggs": {
"types": {
  "terms": {
    "field": "type",
    "order": { "max_score": "desc" },
    "size": 200 // The size will match the length of the array in `should`
  },
  "aggs": {
    // We cannot reach to the _score in the terms aggregations and need this extra aggs to bring it up.
    "max_score": { "max": { "script": "_score" } }
  }
}
}
}

The problem with that approach is that we need to keep a list of all the built-in saved objects and add tests to make sure that folks keep the list updated the moment they register new types.

What do you think?

afharo commented 2 years ago

I created #135837 based on the 2nd approach (terms aggregation sorted by _score) + others bucket.

It adds a Jest Integration Test that compares the list of SOs and the currently registered ones. If they don't match, it'll fail. Hopefully, it doesn't add too much burden to folks adding/removing SOs.

elastic / kibana

Core usage collector for saved objects per type only collects top 10 types #135119