elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.11k stars 24.53k forks source link

Support case insensitive search on new wildcard field and keyword #53603

Closed markharwood closed 4 years ago

markharwood commented 4 years ago

Currently the wildcard field only supports case sensitive search but it is vital that we find a way to offer case insensitive search too. A recent blog post highlighted general string-matching problems and how users have resorted to ugly regex expressions like this one to overcome issues with case sensitivity:

/[Cc]:\\[Ww][Ii][Nn][Dd][Oo][Ww][Ss]\\[Ss][Yy][Ss][Tt][Ee][Mm]32\\.*/

The example above is a search for a string from a case-insensitive operating system where hackers may have used mixed case commands deliberately to try avoid simpler rule detection.

Solution 1: Index-time case choices

We could make the wildcard field accept an optional normalizer to lower-case the content at index time (much like the keyword field). However, in a centralised logging system we may be storing content from both Windows and Unix machines which are case insensitive and case sensitive file systems respectively. The importance of case may vary from one document to the next. This would typically mean that we would be forced to index with multi-fields (one case sensitive, the other not) which would double the storage costs.

Solution 2: query-time choices

The wildcard field already has 2 representations of the original content - an ngram index for approximate matching and a binary doc value of the original bytes for verification of approximate matches. If the ngram index is changed to always use lower-case then the decision to have case-sensitive matching or not becomes a query-time option when verifying candidate matches. There would be a (likely small) increase in the number of false-positives from the approximate matching but the big advantage is no increase in today's storage costs (actually a decrease if we normalise ngrams).

In either solution the searcher has to make a conscious decision - either to search a case-insensitive field or to declare the query clause as case-insensitive.

Solution 2 looks preferable to me from the back end but is a break with existing approaches where case-sensitivity is an index-time mapping decision not a property of a query clause. This means that the wildcard query clause would have a case-sensitive parameter that is relevant if you target a wildcard field but not on a text or keyword field (although we could amend keyword field logic to support this too).

Thoughts @jimczi @jpountz ?

elasticmachine commented 4 years ago

Pinging @elastic/es-search (:Search/Mapping)

markharwood commented 4 years ago

Maybe there's a third solution. Solution 1 indexes content twice. Solution 2 changes the query syntax. Maybe a wildcard field can automatically have a pseudo multi-field which is defined as being case insensitive. As an example - a field called foo can be assumed by users to have a foo._nocase field which isn't anything physical -it's really just used as a signal that the verification query should make the pattern-matching it does case insensitive. There's no extra storage costs or extra flags required on query clauses with this approach.

markharwood commented 4 years ago

I opened a PR for this third option. https://github.com/elastic/elasticsearch/pull/53814

webmat commented 4 years ago

From ECS' POV it's important to preserve the ability to query case sensitively, as well as offer the ability to query case insensitively. Both are important, but my understanding is that case insensitivity is the most important one of the two, especially on a fuzzy kind of search like wildcard.

Currently in ECS, most fields are keyword only. Some fields that require flexibility in how users search, whether because their content is messy (e.g. user agent) or because they're places where users do threat hunting (e.g. file paths & names, command lines) now also have a text multi-field. So the canonical string fields (e.g. myfield) are always keyword, and in some places we have myfield.text for full text search.

I think in most cases, the same fields that have text multi-fields will be candidates for wildcard as well.

If we add wildcard fields to the mix, they would likely be as another multi-field (e.g. myfield.wildcard). Here I'm making the assumption that wildcard could not replace keyword as the canonical type, since keyword and wildcard are not similar enough. Is this assumption correct?

If the above is correct, then when adding wildcard as a multi-field at myfield.wildcard, my understanding of your example would be that we also have myfield.wildcard._nocase as a way to query case insensitively?

If this is the case, I would like to suggest we flip the behaviours around instead. By default, I think most people would want case insensitive search on a wildcard field. Wildcards are already a fuzzy search. Then only when case sensitivity is needed (and keyword doesn't solve the need), they would resort to a virtual subfield to get wildcard + case sensitivity.

In other words, can we do:

@rw-access was telling me Endgame only supports case insensitive search, and then additional filtering or analysis is done, if case is important.

And actually, based on what Ross was telling me, perhaps we could even consider having wildcard only do case insensitive, and not even have a virtual sub-field?

@neu5ron ping

markharwood commented 4 years ago

Here I'm making the assumption that wildcard could not replace keyword as the canonical type, since keyword and wildcard are not similar enough. Is this assumption correct?

I think the differences come down to: Feature keyword wildcard
Sort by speeds Fast Not quite as fast (*caveat 1)
Aggregate speeds Fast Not quite as fast (*caveat 1)
Prefix query speeds (foo*) Fast Not quite as fast (*caveat 2)
Leading wildcard query speeds on high-cardinality fields (*foo) Terrible Much faster
Term query. full value match (foo) Fast Not quite as fast (*caveat 2)
Fuzzy query. Y (if allow expensive queries enabled) N
Regex query. Y (if allow expensive queries enabled) N
Range query. Y (if allow expensive queries enabled) N
Disk costs for mostly unique values high lower
Disk costs for mostly identical values low medium
Max character size for a field value 256 for default JSON string mappings, 32,766 Lucene max unlimited

Caveat 1: somewhat slower as doc values retrieved from compressed blocks of 32 Caveat 2: somewhat slower because approximate matches with ngrams need verification

perhaps we could even consider having wildcard only do case insensitive, and not even have a virtual sub-field?

That would be faster to search. There's an overhead in my option 3 converting stored mixed-case values to lower-case at query-time. While it minimises disk storage costs required to support CS and non-CS the better trade-off might be to just make the field fast for the primary use case (non-CS).

jimczi commented 4 years ago

And actually, based on what Ross was telling me, perhaps we could even consider having wildcard only do case insensitive, and not even have a virtual sub-field?

I like the fact that the wildcard is just a keyword field optimized for wildcard queries. For this reason I think it would be easier to simply allow a normalizer like we have for the keyword field. This would force users to choose whether they want to normalize upfront but I don't think we should eagerly normalize or create a virtual sub-field.

markharwood commented 4 years ago

I opened https://github.com/elastic/elasticsearch/pull/53851 to add the normalizer support.

There's no normalisation by default but I would like to make it easier for the simple case of users wanting case-insensitive. Having to declare an analysis section just to define and register lower-case token filters is a pain. Can we ship elasticsearch with a named normalizer e.g. lowercase? That would make adding case insensitivity for a field a single-liner rather than complex JSON to define the analysis settings for lowercasing. I opened https://github.com/elastic/elasticsearch/issues/53872 to discuss adding a pre-declared named normalizer.

neu5ron commented 4 years ago

Hey all, co-author of the ugly regex blog here 🙃 Great discussion!

I like the proposed solution of keyword and wildcard case insensitive. I want to stress however, that if its made a choice then the community will run into similar issues that are occurring elsewhere - which one entity having something different than another. This is not something that certain people need - this is something that all security/logging use cases need.

Regarding storage - I believe solving a visibility gap is critical and the side affect of increasing storage is an acceptable con/negative. Also:

There should still be a good analyzed field similar to the text analyzer - maybe even an improved one for security use case. Because if we dont have an analyzed field then we miss out on a lot of the additional powers of lucene like fuzzy query and the newer terms query. However, I can live without an analyzed field if we have a case insensitive ability. Because not many were using those advanced queries of lucene - and for those who were they can add it if need be.

webmat commented 4 years ago

Here I'm making the assumption that wildcard could not replace keyword as the canonical type, since keyword and wildcard are not similar enough. Is this assumption correct?

Let me phrase that as a more direct question :-)

If index-1 has field myfield as keyword and index-2 has the same field as wildcard, would a query across myfield-* raise an error such as "aggregation_execution_exception"?

Also, can we do aggregations on wildcard fields?

Understanding whether we can replace keyword fields transparently with the wildcard data type where appropriate will inform our strategy here.

neu5ron commented 4 years ago

i think the gist is we need to have keyword and have to wildcard lowercase. if the standard is set, implemented in beats or what not, and communicated then we should not have to worry about such overlap of fields - right @webmat

markharwood commented 4 years ago

Understanding whether we can replace keyword fields transparently with the wildcard data type where appropriate will inform our strategy here.

All field types are expected to respond to requests for the same set of query types prefix/term/wildcard/range/fuzzy etc. Some field types outright reject some query types (eg wildcard currently doesn't do fuzzy) while others will attempt to perform a query type but not nearly as fast as other field types because their data structures aren't optimised for that case. For this reason it's not always a yes/no supported-features matrix for field types - there are variable performance characteristics. Perhaps the main reason for creating the wildcard field is not that keyword fields couldn't run wildcard queries but because they did so very slowly.

See my feature comparison table here

I'm also working on a blog for the 7.7 release to help expand on choosing between field types now that we have wildcard in the mix.

jpountz commented 4 years ago

@markharwood Actually I like your solution 2. It is simple, does not increase storage requirements, and I agree with you that false positives shouldn't increase dramatically. I wouldn't mind adding a case_sensitive: true/false option to wildcard and regex queries, we could rewrite the automaton for text/keyword fields and lowercase values dynamically for wildcard fields.

I agree with you that case-insensitive search is important. Among all the ways that content may be normalized, I believe that case folding is a bit special, for instance grep has an option for case-insensitive search, which I seem to use more often than not given my shell history, while it doesn't have any option for accent removal or other forms of normalization: users are expected to tweak their regex in such cases. I think it makes sense for us to follow a similar model, by introducing an ignore_case or case_sensitive option to the wildcard and regex queries.

markharwood commented 4 years ago

I believe that case folding is a bit special

Yes, my assumption is that it should be optimising for string equivalence as determined by machines rather than any sloppier equivalence that might be acceptable to humans. In other words the stricter set of normalization rules that are permitted by the OS when referring to files (so I doubt accent removal is required).

defensivedepth commented 4 years ago

Greetings All -

I have been following this thread closely, as it directly impacts the project I work on (@security-onion-solutions).

At this point, have any final decisions been made about how this should be handled?

markharwood commented 4 years ago

At this point, have any final decisions been made about how this should be handled?

We're still deciding. The options and their pros/cons here.

Option implementation detail pros cons
1 index-time treatment Wildcard field has the option of a normalizer (like keyword field) Fast search, works with multiple query types requires more disk space if both case sensitive and insensitive search required
2 query-time (via flags) Add case-sensitive flags to regex and wildcard query types Lower disk costs inconsistent query logic (flags won't work on some fields eg text and only accepted on selected query types), slower search speeds
3 query-time (via virtual field) A virtual sub-field is created for denoting case insensitive searches Lower disk costs, works with multiple query types slower search speeds

It's also possible that option 1 could co-exist with options 2 or 3 but it would become confusing if any index-time choices contradict the query-time choices e.g a case-sensitive mixed-case query is targeting a field which has opted to use a lower-case normalizer.

defensivedepth commented 4 years ago

Thanks @markharwood

I think that increased storage is an acceptable trade-off in this particular situation - my vote would be option 1).

There are two parts to solving this issue: 1) Development of the solution 2) Getting people to use the solution. If the solution is optional, it will become yet another esoteric setting that users will need to figure out.

TL;DR: It needs to be non-optional, or at least, default to the proposed solution, with a way to disable it if need be.

markharwood commented 4 years ago

If the solution is optional, it will become yet another esoteric setting that users will need to figure out.

That's the dilemma. We could up-front automatically optimise search for every conceivable query type (wildcard, exact-value-match, word-based matches, case sensitive, case-insensitive) but that would require multiple data structures which means more disk space. If users only want to pay for what they need they must opt in to these specialised field configurations or live with the limitations of an unoptimised field for some queries (eg wildcard searches on a keyword field).

All of the above is a statement on elasticsearch's general policy to handling string fields. When it comes to a more targeted domain like ECS the query types we want to support for particular fields are know-able in advance. Picking the right elasticsearch configuration for each field defined in ECS is where wildcard field will see the adoption and those choices should be debated in ECS github issues rather than here. In core elasticsearch we just need to make it possible for ECS to configure solutions that are appropriate rather than automatically prescribing wildcard support for all elasticsearch users.

rw-access commented 4 years ago

When it comes to a more targeted domain like ECS the query types we want to support for particular fields are know-able in advance.

I think this has been understated in this thread. There's been some back and forth largely about what the defaults are, but in my opinion this will largely come down to the mappings provided with ECS. Personally, I would lean towards conservative defaults within Elasticsearch and communicating well what those are and how to change them. It isn't necessarily fair to all users to be affected by a bias towards common SIEM use cases.

ECS is where it seems most appropriate to define both wildcard and case sensitivity on a per-field basis.

I wouldn't mind adding a case_sensitive: true/false option to wildcard and regex queries, we could rewrite the automaton for text/keyword fields and lowercase values dynamically for wildcard fields.

++ for this query-time transformation. It also has the nice property of backwards compatibility. We've been talking about case-sensitivity with EQL as well and I was considering something like this when we want a case-insensitive search on a field indexed with its original case. Being able to do this on the fly and automagically, without the hoops @neu5ron mentions in his post is a big win.

markharwood commented 4 years ago

We've been talking about case-sensitivity with EQL as well and I was considering something like this when we want a case-insensitive search on a field indexed with its original case.

Worth noting that while this is a win for avoiding reindexing it's a lose for the user attempting a case-sensitive search on a field indexed with lower-case. It's a break with the long-held principle of case-sensitivity being determined by choice of field names, not query flags. That's why I proposed option 3.

rw-access commented 4 years ago

Yes support for 1 and 3 is perfect IMO.

With 2 and 3, it looks like its still the same transformation, but just a difference in whether it's applied to one field or all fields in the query.

Would the proposed .lower/.nocase virtual subfield be supported with both keyword and wildcard? Both have independent value and many rules will not need wildcards, but still need case-insensitivity.

markharwood commented 4 years ago

Would the proposed .lower/.nocase virtual subfield be supported with both keyword and wildcard?

That's possible but I'm concerned how other query types (term, terms, prefix) would be expected to behave on a keyword virtual field. With a wildcard field, all supported query types (wildcard/term/prefix etc) have a quick approximation ngram match which must be verified by retrieving the docvalue which is where we get the opportunity to lowercase on the fly if required. Some slowness is a built-in expectation for all query types so the cost of lowercasing on the fly in a wildcard's virtual subfield is comparatively small.

However with a keyword field the non-wildcard queries like term query are expected to be fast because they look up a single term directly in the index. Having a virtual .nocase field on a keyword would not be a cheap operation for term queries because they'd have to scan all the mixed case terms in the index (much like a leading wildcard query does today). There's no ngram approximation index to accelerate. That feels potentially trappy.

markharwood commented 4 years ago

After further discussion we concluded it would be useful to offer query-time case insensitive search options with the assumption they could be used on both wildcard and keyword fields.

There are a number of open questions still at this stage: 1) What query types will offer a parameter for case sensitive/insensitive matching? We assume the RegExp query would but not sure what other types would (wildcard? term? prefix?) 2) What do we do if a user supplies a case sensitivity preference at query-time and the targeted keyword field has used some form of normalizer at index time?

Unlike the wildcard field, we can't always guarantee a keyword field will have the original un-normalised strings easily accessible from doc values. If the content was normalised to lowercase and the query is a mixed case string with a case sensitive parameter we should probably error loudly rather than fail to match silently. Errors can be worked around but silent failures go unnoticed and can mislead users.

We agreed that a query with a case-sensitivity parameter set should fail when used on a text field because these nearly always perform some kind of normalisation at index time. Users should continue to think of text matching as something that has to have normalisation logic applied at index time.

markharwood commented 4 years ago

In terms of implementing case insensitive regex query on keyword fields - which of these 2 approaches would we use? A) Use a new query implementation which does a linear scan of all doc values, lowercasing the regex query and each of the doc values read from disk on the fly or B) Use the existing RegExpQuery but pre-process the regular expression used by expanding the permutations to all case variations (e.g. a search for "foo" is expanded to "[Ff][Oo][Oo]")

A is simple to implement but slow. B is more complex to implement and perhaps not necessarily faster. Did you already have an idea on how this would work @jimczi ?

jimczi commented 4 years ago

I prefer option B, the automaton is intersected with the terms dictionary so the expansion is limited to matching terms. In the worst case, all permutations exist in the dictionary, but the multi-term query handles it smoothly with the CONSTANT_SCORE_REWRITE method.

What query types will offer a parameter for case sensitive/insensitive matching? We assume the RegExp query would but not sure what other types would (wildcard? term? prefix?)

I have a slight preference for a new exact_match query as described in this issue. We need a query with a clear intent (matches the entire input) but term-based queries adapt their behavior based on the type of the field.

markharwood commented 4 years ago

Update - a PR for case insensitive Regex searches is happening in Lucene

markharwood commented 4 years ago

Following some discussion we concluded that term, prefix and wildcard queries should also have a case insensitivity option.
Sadly this can't be a simple boolean flag - for these queries we assume if you do nothing the matching is case sensitive. Unfortunately that it is not true for all fields - normalized keyword fields monkey with the search terms in term, wildcard and prefix queries whereas other fields do not. Given the BWC issues this brings up we probably need a tri-state parameter rather than a boolean. Something like:

"match_mode" :  "case_sensitive" / "case_insensitive" / "legacy"

The "legacy" mode would be the default, preserving the current (inconsistent) matching behaviour. The 2 other modes would do exactly as you would expect in relation to matching the search input with the indexed tokens (which can differ from the JSON source)

webmat commented 4 years ago

A tricky situation, but I like this proposal 👍

mayya-sharipova commented 4 years ago

@markharwood Is the plan first to correct term queries to remove normalization from them? Otherwise, if term queries still do normalization, will it case_sensitive option work at all?

markharwood commented 4 years ago

Is the plan first to correct term queries to remove normalization from them?

@mayya-sharipova that's not clear to me - it will need thrashing out on https://github.com/elastic/elasticsearch/issues/25487

For the moment I'm assuming we're not relying on getting that fix because it's a breaking change that will need to wait for 8.0 and we want case insensitive search out in 7x

will it case_sensitive option work at all?

No - we could choose to keep the provided query string's case if case-sensitive explicitly set in the query's param, but the problem is text fields and normalized keyword fields are likely to have erased case differences in the index making the setting pointless in those cases. I think our policy on policing pointless queries was to just allow them given @jimczi said this:

The scope of the case insensitive option is the query terms, not the indexed terms so I don’t think it should be considered that matching a text field with case insensitive have silent failures

I'm not sure if the last "case insensitive" in the above statement should have read "case sensitive". Either way, I took that to mean we are not going to try warn or error if a user picks an inappropriate combination of query and index settings eg a case sensitive search on a text field that indexes as lower case. The "query plausibility" test I proposed was an attempt to detect useless query/field combinations and throw errors but it wasn't a foolproof test for all query types so I think we are relying on users knowing what they are doing in relation to their choice of indexed terms.

mayya-sharipova commented 4 years ago

@markharwood Thanks for the clarification, makes sense.

we could choose to keep the provided query string's case if case-sensitive explicitly set in the query's param

This makes sense to me: when case_sensitive search is requested, we will never run the query through a normalizer.

markharwood commented 4 years ago

Query_string will be a challenge

Another consideration is that the most popular way of writing wildcard queries is not done using JSON - it's likely via query_string which has its own parameters for determining how to treat wildcarded strings.

The analyze_wildcard setting in query_string does not control a keyword field's normalization logic - the field always normalizes the wildcard query input, regardless of this setting.

The analyze_wildcard setting also has no effect on case sensitivity in text field matching - it always lower-cases the wildcard query input even with analyze_wildcard : false.

These existing design choices I assume are guided by the idea that normalisation is a base level of functionality that sits below "analysis" choices like stemming etc. As such, it is always on regardless of the analyze_wildcard setting choices.

So, any new flag I assume we add to query_string would be providing a form of query-time normalization for those fields that have had no index-time normalisation (wildcard, and keyword-with-no-normalizer). This will be tricky to name and document - I expect most users will struggle to grasp the subtleties of this new flag Vs the existing analyze_wildcard setting.

KQL might be simpler

Practically speaking, many users will be entering queries using Kibana and therefore KQL so I have opened a Kibana issue to add regex syntax where case sensitivity control would be controllable using /Foo/i syntax common to other regex implementations.

ebeahan commented 4 years ago

Great discussion here all!

@markharwood - Wishing to clarify an earlier discussion point:

All field types are expected to respond to requests for the same set of query types prefix/term/wildcard/range/fuzzy etc. Some field types outright reject some query types (eg wildcard currently doesn't do fuzzy) while others will attempt to perform a query type but not nearly as fast as other field types because their data structures aren't optimised for that case. For this reason it's not always a yes/no supported-features matrix for field types - there are variable performance characteristics.

As ECS looks to adopt wildcard, we continue to evaluate where wildcard could be a transparent replacement for keyword. Per your earlier feature comparison and your reply to @webmat's ask, regex wasn't supported by wildcard at that time. Is this still the case? Would wildcard outright reject a regex query as was described for fuzzy?

jimczi commented 4 years ago

Would wildcard outright reject a regex query as was described for fuzzy?

The wildcard field should behave exactly like a keyword field. That was the requirement to release this field in 7.x and it was achieved in time for 7.9. That means that you can use a keyword or a wildcard field seamlessly in all mappings, they should return the same documents on every requests (@markharwood added the support for fuzzy and regexp in the meantime so the comments on this issue are outdated).

markharwood commented 4 years ago

Closing in favour of https://github.com/elastic/elasticsearch/issues/61162

qbit-git commented 1 year ago

so how to deal with wildcard datatype case insensitive in query_string now? @markharwood wildcard query in wildcard datatype NOT regex query in wildcard datatype