elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.07k stars 24.83k forks source link

New API to debug and test Regex query expressions. #57033

Closed markharwood closed 2 years ago

markharwood commented 4 years ago

Regular expression queries are likely to become much more popular with the forthcoming introduction of the wildcard field. In Lucene we opened up access to some of the internals to get detailed parser information and in Kibana we have proposed a UI to help users write and test regular expressions. This issue is intended to provide the glue between these two pieces of work by exposing an API that provides a structured representation of a regex string. The input is a plain string which is the regular expression and the output is a structured JSON object representing the parse tree of RegExp objects and the type of expression each node represents.

It's not clear to me that this falls neatly into either the existing validate or explain APIs for queries. My assumption is that the break-down of a regex should not be tied to a particular elasticsearch index or field - it's only about explaining the internal complexity of a given regex string. It's possible that we should also provide an API to test a given regex against a given string that clients want to match. Clients like Kibana could then supply the test string from any source.

Why can't users just use existing online regex testing tools?

Online testing tools like https://regex101.com/ are useful (and to be emulated) but the Lucene regex support is not complete and it is useful to have a tool that properly represents the features that are available.

API outline

A new _regex_test endpoint would take a regex and an optional value

GET /_regex_test
{
  "regex" : "(foo|net)ball",
  "value" : "football"
}

The result could look like this:

{
  "valid": false,
  "matches": true,
  "parse_tree": {
    "concatenation": [
      {
        "union": [
          {
            "string": "foot"
          },
          {
            "string": "net"
          }
        ]
      },
      {
        "string": "ball"
      }
    ]
  }
}   

The parse tree represents a parsed form of the regular expression logic.

New query type - equivalent of interval queries for regex

The parsed regex outputs like the above example could potentially be used as a new query type which would create AutomatonQuery objects when executed. End users wouldn't be expected to type this JSON but it's not hard to imagine a Kibana UI that takes a regex string as fast form of data entry but then renders the JSON response graphically for review and further editing.

Search_engines_guide_to_conversing_with_humans_copy_2

The advantages of working with a graphical representation are: 1) The user can review the logic of the regex parser's interpretation (for years Lucene silently failed on the \w syntax). 2) We can expose existing backend features like ignore_case with a checkbox that we've failed to add to the simple string syntaxes like KQL and Lucene query_string. (Simple query string expressions can't scale in complexity: requires too many "special" control characters). 3) We can expose Automaton features that go beyond regex syntax e.g. fuzzy.

Conclusion

Complex character sequence matching is a common practice in the security field and warrants a better UX than the cryptic and black-box approach of writing regex strings. These elasticsearch changes would lay a foundation for making big improvements in the user experience.

elasticmachine commented 4 years ago

Pinging @elastic/es-search (:Search/Search)

javanna commented 2 years ago

Given our current roadmap and near-term focus, we are unlikely to get to this request anytime soon. This issue has been open for 2 years with no action taken, and from our triage, it doesn't look like it blocks any critical feature or implementation. We'll go ahead and close it but please re-open or add a comment if you are awaiting its resolution.