Support for `wildcard` fields

jpountz commented 4 years ago

Elasticsearch has a new wildcard field that mostly behaves as a keyword field but runs wildcard queries more efficiently.

Relates to elastic/elasticsearch#53175 and #35481.

elasticmachine commented 4 years ago

Pinging @elastic/kibana-app (Team:KibanaApp)

elasticmachine commented 4 years ago

Pinging @elastic/kibana-app-arch (Team:AppArch)

timroes commented 4 years ago

Thanks for creating this. In general it would be helpful if you state something like "mostly behaves the same" if you could list the differences, since they might have a high impact on whether and how we can solve that issue or not. Especially useful are answers to the questions:

Does it support all queries exactly the same as a keyword field?
Does it support all aggregations exactly the same as a keyword field?
Are there any specifics around that field in _source or docvalues?

But in general every API/behavioral difference to the keyword field would be very helpful :-)

markharwood commented 4 years ago

The wildcard field compares to `keyword` field as follows: I think the differences come down to: Feature	keyword	wildcard
Sort by speeds	Fast	Not quite as fast (*caveat 1)
Aggregate speeds	Fast	Not quite as fast (*caveat 1)
Prefix query speeds (foo*)	Fast	Not quite as fast (*caveat 2)
Leading wildcard query speeds on high-cardinality fields (*foo)	Terrible	Much faster
Term query. full value match (foo)	Fast	Not quite as fast (*caveat 2)
Fuzzy query.	Y (if allow expensive queries enabled)	N
Regex query.	Y (if allow expensive queries enabled)	N
Range query.	Y (if allow expensive queries enabled)	N
Disk costs for mostly unique values	high	lower
Disk costs for mostly identical values	low	medium
Max character size for a field value	256 for default JSON string mappings, 32,766 Lucene max	unlimited

While @jimczi and @jpountz have thought of this as predominantly a keyword field with wildcard optimisations I think the last feature in this table is important. For large machine-generated content such as: 1) Our own CI build output 2) Elasticsearch log files with big stack traces

With values >32k we physically can't use keyword fields due to a Lucene limit but equally we might not want to treat the content as a text field because 1) We don't want to complicate indexing by having to consider which characters like ., /, \ etc are word-separators 2) We don't want to complicate grep-like searches using wildcards by breaking character sequences along indexed word boundaries and assembling them using bool or interval queries.

In these cases, the answer to the usual "keyword or text?" question is "neither" and wildcard might be a suitable alternative. In this context of handling big-machine-generated values it probably is not a good idea to attempt using it for aggregations or sorting. (What protection should we have for that Jim/Adrien?).

timroes commented 4 years ago

Thanks for the detailed comparison. This is really helpful. While it looks nearly the same, there is one thing that will make a difference for Kibana:

The lack of range queries for wildcard fields, can break in KQL. We don't expose the range query on keyword fields in the filter UI, but you can write KQL queries using > and < on keyword fields. Since they don't work for wildcard fields, there would need to be some special handling for those in KQL.

I'll remove the KibanaApp label from this, since given the list above there is nothing outside App Arch area that would require additions (assuming that we would still mark this as string type in Kibana, and make the difference in KQL based on the esType stored in the index pattern).

markharwood commented 4 years ago

This is a long comment so the "TL/DR" is I think it's worth Kibana giving wildcard fields some special treatment in log message analytics.

Wildcards in log message analytics

Whenever I'm helping support diagnosing elasticsearch cluster failures we have to sift through large log files and I use elasticsearch+kibana. The log messages can be big -here's the range of logged message sizes from a recent typical case:

These fall beyond what would be useful or possible to map as keyword fields so I index as text (and am still finessing what is a good Analyzer setup for this content). In an ailing cluster there's a lot of message repetition (albeit with near-duplicates not exact duplicates). Effective investigation relies on identifying the different types of message and either removing them from the clutter or plotting on a timeline to see the sequencing and volume of events e.g.

Identifying the message type involves copying and pasting parts of the log as a query clause which is where the problems come in. Let's take this example of using a mouse to select the part of a message about a particular failing node - NodeNotConnectedException: [54b_data_2]

However, this selection will not work as a query and is something I struggle with constantly. With a text field the user has to know about the details of the tokenisation policy of where words end and begin to formulate a query. While the selection can be placed in quotes to ensure multi-words are run as a phrase query, particular attention has to be paid to word beginnings and endings. The NodeNotConnectedException part of the selection cuts a token in half because with my Analyzer dots are retained. So the first word needs to be backed up to org.elasticsearch.transport.NodeNotConnectedException. If a similar token-clipping occurs at the selection end we must add a * to the end of the search string. This is painful.

With the wildcard field these sorts of selections could be handled simply - the user selection is wrapped with asterisks and it matches in a predictable way without the searcher or the elastic db admin having to consider tokenisation policies. It does make me wonder how KQL or filter bars may organise these selections (KQL may be clunky if the copy/pasted values contain special chars and filter pills aren't easily ORed).

I see little or no use for sorting or aggregations on a log message field like this so I wonder if we should have the option to disable that particular wildcard field behaviour either at the elasticsearch level or the kibana level.

Maybe we need to think of the "wildcard-on-big-log-messages" and "wildcard-on-shorter-keyword-like fields" as two distinct use cases in Kibana/elasticsearch?

markharwood commented 4 years ago

Related - a regex debugger would be very useful: https://github.com/elastic/kibana/issues/66735

jpountz commented 4 years ago

@markharwood can this be closed now that wildcard fields pretend to be keyword fields in the _field_caps API? I'm expecting Kibana support for wildcard to come for free?

markharwood commented 4 years ago

can this be closed now that wildcard fields pretend to be keyword fields

I still have a suspicion large wildcard fields shouldn't be included in Kibana's drop-down lists for sorting or aggs along with the "proper" keyword fields. Admins and users alike will be frustrated by the circuit-breaker exceptions these would cause.

We know wildcard will be useful on large fields and we removed any "ignore_above" limits for them. I just can't see large fields making sense for sorting or aggs. Not sure how Kibana adds protection for that.

webmat commented 4 years ago

I was just now discussing how I expect we'll want to use wildcard for fields such as error.stack_trace... So I agree some problems could be lurking if users try to do aggregations on those

jpountz commented 4 years ago

@markharwood I'm seeing this as an orthogonal issue that shouldn't be Kibana's concern, but Elasticsearch: If a field shouldn't be aggregated via Kibana, then it shouldn't be reported as aggregatable in _field_caps. So I'd suggest closing this issue and raising the question of how Elasticsearch should report large wildcard/keyword values such as stack traces.

markharwood commented 4 years ago

If a field shouldn't be aggregated via Kibana, then it shouldn't be reported as aggregatable in _field_caps

Good point. I'll open an elasticsearch issue.

I'm not convinced there's nothing left to be thought about in Kibana-land. For example - if they support a *foo* style query in the KQL bar and assume, like normal whole-term based queries, that can be run across multiple fields then it may result in slow results or timeouts. Wildcard fields will be fast but hitting other fields which are keyword will involve an expensive linear scan. They might want to think about how to manage those inequalities with these expensive queries.

jpountz commented 4 years ago

As wildcard fields can't be distinguished from keyword fields from Kibana, I think that this one should be a question for Elasticsearch too?

markharwood commented 4 years ago

As wildcard fields can't be distinguished from keyword fields from Kibana, I think that this one should be a question for Elasticsearch too?

That sounds like adding a different field-expansion list for wildcard/regex queries than the existing general-purpose one? Might be some BWC things to consider with any change there.

As for the aggregatable Y/N question, there's 2 options

1) static - @colings86 and I discussed about adding a possible wildcard_text type to signal the supported use cases 2) dynamic - es admin can disable aggs using a field caps change.

With 2) there's questions about how Kibana might pick up a change in elasticsearch field_caps too if we make that dynamic. Maybe that's just a manual index-pattern refresh in Kibana. Do we already have an issue for making field_caps dynamic?

jpountz commented 4 years ago

No I don't. For the record, it might also be ok to not do anything and rely on circuit breakers to abort aggs on stack traces.

markharwood commented 4 years ago

it might also be ok to not do anything and rely on circuit breakers to abort aggs on stack traces.

I think that was Jim's working assumption - the question is whether users and admins are going to be happy with that.

rayafratkina commented 2 years ago

@mattkime @petrklapka is this closed by mistake or actually confirmed to be working?

mattkime commented 2 years ago

@rayafratkina Thanks for bringing this to my attention as I should leave some notes -

wildcard fields have been supported as keyword fields since the field caps api started reporting them as such - https://github.com/elastic/elasticsearch/issues/53175

For more refined handling of these fields we'll need a method of identifying them as their true type - https://github.com/elastic/kibana/issues/120284

elastic / kibana

Support for `wildcard` fields #60933

Wildcards in log message analytics