Apply context filters when retrieving list of fields in application

jimczi commented 3 years ago

In https://github.com/elastic/elasticsearch/issues/56195 we added the ability to filter the field capabilities output with an index filter. The idea is that _field_caps can dynamically retrieve a list of fields that are relevant for the context. If a user in Discover has a range filtering the last 3 days, we should restrict the list of available fields for suggestions to the ones that appear in the range. Our Observability solution also uses constant_keyword to differentiate data streams so applying the current filters to the _field_caps call could limit the fields to a specific metricset like cpu for instance.

This change is important for our Solutions in order to have a way to limit the number of fields in large index patterns when the context is narrowed (by a range filter or a filter on the dataset). Today the list of fields of an Index Pattern is retrieved once on each app through field_caps without taking the context into account.
We should make it even more dynamic and apply the context of the apps more eagerly when possible (changing the range filter should update the list of available fields).

elasticmachine commented 3 years ago

Pinging @elastic/kibana-app-services (Team:AppServices)

mattkime commented 3 years ago

I think the next step for this would be to design a specific UX that would benefit from this enhancement. Which solution?

@ruflin You spoke up on the ES issue, I'm curious if you have any ideas.

ruflin commented 3 years ago

++ on what @jimczi described. On my end I had initially mainly Dashboards in mind that a dashboard could define its prefilters for fieldcaps and looking at a mysql dashboard, only the related mysql fields are suggested.

On the solution side a few ideas:

APM: APM has a specific set of indices. In the future these are across logs, metrics, traces but can be prefiltered to only show the releveant ones. @sqren
Endpoint UI: Same as API, which indices to be used is known in advance. @kevinlog
Metrics UI: I expect his is also a predefined subset to filter on. @jasonrhodes

sorenlouv commented 3 years ago

If a user in Discover has a range filtering the last 3 days, we should restrict the list of available fields for suggestions to the ones that appear in the range.

Yes! I'd love that. We could really need that in Observability for the search bar that currently suggests many irrelevant fields. I think that would also mostly solve https://github.com/elastic/kibana/issues/94879 cc @andrewvc

kevinlog commented 3 years ago

cc @peluja1012 @XavierM - this could be useful for future iterations of the global search bar in the Security app. We could restrict some field suggestions.

@sqren

Yes! I'd love that. We could really need that in Observability for the search bar that currently suggests many irrelevant fields. I think that would also mostly solve #94879

This would also be helpful for us in the Security UI, however I think we'd still like to "map" the suggestions to a more human readable format instead of always showing the fields as the appear in the document. What @andrewvc mentions here: https://github.com/elastic/kibana/issues/94879#issuecomment-804948223 is still relevant and should also be solved.

mattkime commented 3 years ago

Could a solutions team drive the planning for this effort? This needs a thorough UX story - How does the user set the filter? (either directly or indirectly via an existing control) Do these filters persist between Kibana apps? Perhaps there are other questions we should be asking but those are the first that come to mind.

From there I'd need to figure out what is needed from the index patterns api.

andrewvc commented 3 years ago

++ to @kevinlog , it doesn't hit the usability threshold we need.

While I think this proposal is useful it doesn't solve https://github.com/elastic/kibana/issues/94879 , though it might be a step in the right direction. What that issue calls for is a human curated list of fields that are useful. If Elasticsearch mappings had the ability to tag fields as 'friendly' or 'high priority', that would be more helpful, though I'm not convinced that's the right approach. That would look something like what is below:

{
  "monitor.name": {
    type: "text",
    friendly_name: "Monitor Name",
    description: "The monitor's human readable name"
    priority: true // definitely needs a better name, 'suggested?'
  }
}

The point is we just want to show the most frequently used fields. In any given solution there are lots of internal fields that are used for various purposes but rarely queried by users. We should not suggest these.

The real question is where do we do this work. Do we do it in ES? In Kibana? If Kibana, do we do it in an Index Pattern? Something else on top of one?

My $0.02 is that it's better to do it in Kibana and bundle a field list with each solution or package. I'd just make this a feature of the Kuery Bar UI component.

I'm on vacation starting next week lasting through mid-month, but I'd be glad to move this forward when I return, though I won't have time before this.

mattkime commented 3 years ago

@andrewvc

The point is we just want to show the most frequently used fields.

Index patterns already has this via the field 'count' property. Discover is the only app that makes use of it but it certainly could be expanded. If a particular solution creates index patterns, count values for the various fields could be preset. In the UI, this value is labeled 'popularity'

andrewvc commented 3 years ago

@mattkime I realize the word 'frequently' did not convey my intent accurately. I meant it to mean the most useful fields, not the ones that were literally the most accessed.

Let's think about users who don't know every field in the Uptime or APM schemas. They don't want a field named monitor.duration.us, and then to query by microseconds, they just want "Monitor Duration", in seconds or ms. They don't even want to see most of the fields we have, many of which are esoteric, or best for internal use.

This list needs to be manually curated, not algorithmically generated. This whole problem is analogous to the situation between the new Exploratory View and Lens. Lens is a power tool with full access to the schema, and assumes the user is comfortable with, or willing to learn the under the hood schema. The reason the exploratory view is so powerful is it presents users with commonly used fields exclusively, and doesn't use complex schema dot notation, rather favoring friendly names, and only showing a small number of the most important fields.

andrewvc commented 3 years ago

That said, this specific ES feature is probably still a nice win for the Kuery bar in the discover app, and may be part of a larger approach for solutions. (IMHO)

jasonrhodes commented 3 years ago

I'm not sure the manually-curated solution @andrewvc is describing would ever work for us in the Metrics UI (and I'm not sure about the Logs UI, either), largely because so much of what a user will be interacting with will be dynamic on some level. So even if we are able to tell in advance which are the most useful fields a user may want to query on, knowing if those fields are even present in the data (and in the time range being queried) would be incredibly useful to us. @simianhacker has taken several stabs at trying to do this for us in the Metrics UI to no avail.

It would be fantastic if a field's definition somehow included the human readable details alongside its mapping (this feels like we're veering back into all of the field customization from Kibana Index Patterns?) so that when we detect the existence of monitor.duration.us in the queried data, the field is presented to the user as Monitor Duration (just as an example using words already mentioned here so far).

ruflin commented 3 years ago

Elasticsearch supports metadata for each field mapping. It would be possible to add this description there. I remember when we introduced this @jpountz mentioned to not encourage "random" data there as it might explode the template size. But maybe something we should look into.

Lets assume for a moment it is in the template. Kibana could directly read it from there. It would also allow that for certain fields like monitor.duration.us which have different descriptions in different context to have it specified multiple times. So metrics-uptime-* template has one description, metrics-nginx.stub_status-* has a different description that fits better the context. Kibana would show the correct one in the correct context based on which indices are crawled.

jpountz commented 3 years ago

Cluster state storage isn't free so we are careful with how we use it. We have plans on the roadmap to deduplicate mappings on data streams, so if you plan on moving forward with things like that, please let us know ahead of time so that we can prioritize accordingly.

I'd need to think more about whether field descriptions would be a good fit for metadata. On the one hand it feels ok because a field's description is metadata about a field, but on the other hand our current features wouldn't allow setting field descriptions on dynamically mapped fields. I'm also unsure how we'd handle i18n, should we store the field's description in all supported languages in the metadata? This would make mappings very hard to read. Filtering the list of fields so that it only shows those that are relevant to the data feels very useful, let's move the discussion about field descriptions to a separate issue?

Regarding relevancy, one idea that has been floating around would be the ability to store telemetry about a cluster's usage within a cluster for our users' purposes. For instance storing information about field access could help provide users with better field suggestions, but we could also leverage this information to make recommendations about which fields it would make sense to move to runtime fields in order to save space, and I'm sure we'll find other use-cases as we think more about it.

ruflin commented 3 years ago

The description of a field could become quite extensive in some scenarios. So pushing all this to the cluster state does not seem like the ideal place. I'm wondering if there are other options where we could store "meta" information about a field which does not have to end up in the cluster state. Usually this information is not required during query time but is retrieved one by one when a user wants to get more information about the field. Specifying it in the template itself would still be convenient ...

jasonrhodes commented 3 years ago

We're really edging back towards storing this info in the Data Views née (Kibana) Index Patterns, eh? If we can solve the sync/caching issue, maybe it's not the worst idea?

ruflin commented 3 years ago

I think there is an issue here with storing it in index patterns. Take my example above with metrics-uptime-* and metrics-nginx.stub_status-*. This would mean, we need an index pattern for each of these. From a technical point of view, this would be ideal but a user would now see on the left side an index pattern drop down with a LOT of entries. This is why we created metrics-* index pattern but it causes issues for us as index patterns are not really aligned with our indexing strategy. Maybe that is the more fundamental issue to be solved.

ruflin commented 3 years ago

This issue derailed a bit into how to store additional information about fields and other discussions. What @jimczi described initial is much simpler and is about exposing all the benefits we get from the data stream naming scheme in Kibana for field caps. The benefits for the users are that they only see relevant fields (older related issue: https://github.com/elastic/kibana/issues/24709) and much better performance. Elasticsearch already has all the required features, Kibana should adopt it.

jasonrhodes commented 3 years ago

@ruflin yeah, that sounds good. Is there a simple example of how to implement this from the Kibana side? I'd love to test this out in Metrics/Logs...

jimczi commented 3 years ago

@jasonrhodes , you'd need to call the field capabilities API augmented with the active filters:

GET metrics-*/_field_caps?fields=*
{
  "index_filter": {
    "term": {
        "data_stream.dataset": "system.cpu"
     }
  }
}

Currently Kibana calls this API without any context (no index_filter) even if filters are defined.

mattkime commented 3 years ago

@jimczi I'm digging in to make sure I have a complete understanding of the basics for this effort.

For the most part, the idea of applying filter criteria from the search bar to field lists is pretty straight forward. However, you mention constant_keyword usage and I'm thinking about how this might be exposed to the user. Unless we plan to educate our users about how to use this field, we need to provide a user interface for selecting it. Do you have any thoughts on how we expose this field and its values?

jpountz commented 3 years ago

@mattkime I don't think that the goal is to expose constant keywords to users. I believe that constant keywords were mentioned because they work especially well with this index_filter parameter, but to me the goal is to make Kibana pass all the filters that it knows about to Elasticsearch and then Elasticsearch will figure out whether it can leverage these filters to narrow down field suggestions. For instance if a user already filtered based on http.request.method: GET and the active time range is on the past week, Kibana could send the following filter to Elasticsearch's _field_caps:

GET logs-*/_field_caps?fields=*
{
  "index_filter": {
    "bool": {
      "filter": [
        { "term": { "http.request.method": "GET" } },
        { "range": { "@timestamp": { "gte": "now-7d" } } }
      ]
    }
  }
}

In that specific case, Elasticsearch will automatically ignore old indices as well as indices that don't have a http.request.method field, which will likely yield a much smaller list of field suggestions that if no filter had been provided.

ppisljar commented 3 years ago

reading all above this is what i propose:

we either extend the getFields method of index pattern to take in optional Filter[] or add a new getFilteredFields method. This method will include provided filters to its query to the field caps API. Index pattern (internal) field list will not be affected by this and will always contain all the fields.
applications that want to consume this are free to do so, for example visualize could set all the filters its sending to a visualization to this method and make sure the field lists it exposes only contain relevant fields.

i have one question still: aren't we putting a lot of effort to work around the original problem we have? we are misusing index patterns in metrics UI and some other solutions to store too many fields. we should rather have multiple index patterns for each of the data type they are storing.

jasonrhodes commented 3 years ago

@ppisljar are you referring to the metricbeat-* index mapping? It's unfortunately rather set in stone, but data streams are the solution to that. We're quite some ways away from when all customers will be storing all data in more segmented data streams, though, so until then this will help a lot.

ppisljar commented 3 years ago

yes i am referring to the metricbeat-* index mapping. It seems you use multiple dataviews in elasticsearch with dense fields and well defined scope, but you fail to transfer that in kibana with a single mapping that just matches everything.

Also i agree with @mattkime , this needs a good user story and UX. I just can't imagine how exactly do we make use of this ? (for example you are looking at a dashboard, you want to add a filter to it. you expect field list to contain less fields. but which ones really ? (your dashboard contains visualizations from dozen different data views, could have fixed (alternative) time ranges defined for specific panels etc etc. So i just can't imagine what exactly should we filter on in such cases.

sorenlouv commented 3 years ago

Also i agree with @mattkime , this needs a good user story and UX. @ppisljar

I can contribute with a use case from APM.

APM transactions are stored in indicies like apm-8.0.0-transactions-* (for data streams this will be traces-apm* but won't change anything).

A user can have (micro)services in different programming languages. Transactions from these will be ingested to the same indicies (aka not an index per service), so services will share the same mapping.

This causes field suggestions to "bleed" over to services where they are not relevant. For instance, jvm.* fields are only relevant for java services but are also displayed for the ruby service:

ruby-jvm

In this case, selecting a jvm.* field in the ruby service will always return 0 results,

For field suggestions we don't have this problem because we filter the suggestions with a terms agg to only get relevant values. In the following (overly) simple example we only expect to see service.name suggestions matching the currently selected service (opbeans-ruby) which is exactly what we see:

ruby-service-name

Similarly on the service overview we expect to see suggestions for all services:

service-overview

It would be AWESOME If we could have the same filtering mechanism for field suggestions like we have for value suggestions, so only relevant fields show up for the selected service.

jimczi commented 3 years ago

A user can have (micro)services in different programming languages. Transactions from these will be ingested to the same indicies (aka not an index per service), so services will share the same mapping.

Why is this set in stone ? Why not using one index per language ? If the mapping of these languages is distinct it would be beneficial to separate them in their own indices.

It would be AWESOME If we could have the same filtering mechanism for field suggestions like we have for value suggestions, so only relevant fields show up for the selected service.

Implementing value suggestions through an aggregation is a bad practice on time-based indices. It is not precise for a big cost in terms of performance and latency. Value and by extension field suggestion needs to be fast so we cannot expect the same flexibility than we have at the query DSL. The new terms_enum API was added to replace this bad habit of using a plain aggregation to get suggestions. We need a better mechanism and that needs to start with the design. If you have a single data stream with different services that don't share the same fields, the recommendation is to split into multiple indices. We don't want to rely on slow features that only shine when tested with 10 documents in a demo.

sorenlouv commented 3 years ago

Why is this set in stone ? Why not using one index per language ?

We did initially consider having a datastream per service but since each service already have 4 data streams (logs, metrics, errors, traces) this would result in 100s maybe even 1000s of datastreams for customers with many services. Therefore the ES team suggested that we didn't split data streams per service, and instead stuck with 4 data streams in total.

jpountz commented 3 years ago

I was involved in these discussions and my recollection is that there was a bit more nuance. We did indeed advise against using different data streams per service to avoid index/shard explosion, however we still think that we should split data streams that would have different mappings. So ideally services should be grouped together depending on whether they would have very similar mappings or not. (There is still some nuance there, e.g. if 1,000 fields are common and only 2 fields differ, maybe it's still a better trade-off to put data into the same data stream to keep the number of data streams under control. The Elasticsearch team is happy to be consulted when cases like that arise.)

In an ideal world, APM would be able to have different granularities for each type of data that it records, e.g. maybe there could be a single data stream for internal metrics since all services have the same mappings for their internal metrics while there would be multiple data streams for traces as we would group services that have the same mappings for traces together. (I can certainly appreciate how it makes the architecture more complex.)

sorenlouv commented 3 years ago

however we still think that we should split data streams that would have different mappings

Okay, this sounds interesting. Something that we could consider is to split data streams by APM agent (so separate data streams for python, node, dotnet etc). I'll take this back to the team and see if this is something that is still possible - not sure whether it is considered a breaking change at this point.

axw commented 3 years ago

In an ideal world, APM would be able to have different granularities for each type of data that it records, e.g. maybe there could be a single data stream for internal metrics since all services have the same mappings for their internal metrics while there would be multiple data streams for traces as we would group services that have the same mappings for traces together. (I can certainly appreciate how it makes the architecture more complex.)

That's what we've ended up doing. We produce multiple data streams, most of which are not service-specific. There is one service-specific data stream which contains custom application metrics and runtime/language-specific metrics. We should be able to add a _field_caps filter for that data stream. @sqren I'll point you at more details offline

jpountz commented 3 years ago

Wonderful!

ppisljar commented 3 years ago

We should be able to add a _field_caps filter for that data stream. @sqren I'll point you at more details offline

Why is field_caps filter still needed ? If you will split up your data streams so they dont have 1000s of fields, is this still an issue ?

axw commented 3 years ago

@ppisljar taking the specific example that @sqren shared above:

For one of the data streams, there are still service-specific fields like jvm.*. We shouldn't show those as field suggestions when in the context of the "opbeans-ruby" service, since it is not running in a JVM and will never produce those metric fields.

ppisljar commented 2 years ago

resolved by https://github.com/elastic/kibana/pull/121367

elastic / kibana

Apply context filters when retrieving list of fields in application #95558