elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.6k stars 24.63k forks source link

EQL: Grammar improvements #54597

Closed costin closed 4 years ago

costin commented 4 years ago

In its current state the EQL grammar resembles the python version to support as much parsing as possible.

However it's clear that some cases cannot be supported so the question is whether we should:

  1. allow them in the grammar and throw an error when they occur This is the most friendly approach however the costly in terms of maintenance and it's worth evaluation whether said features are actually used or not.

  2. reduce the grammar and the code The most convenient in terms of support yet fairly unfriendly for the users.

Based on the existing queries and support, we should try and pick as many 2 as possible vs 1.

Below a list of things that stand out in the grammar:

SEQUENCE (by=joinKeys sequenceParams? | sequenceParams by=joinKeys?)?

A sequence can be declared with both join keys and sequence params but the order does not matter. However in all our examples and test suite only the first form is present. It also seems the most natural:

sequence by key with param vs sequence with param by key

The latter variant is redundant and a bit forced hence why I propose removing it.

The EQL tests allow fractional values and different time units to be specified:

sequence with 
maxspan=200
maxspan=2s
maxspan=2sec
maxspan=2seconds
maxspan=2.5m
maxspan=1.0075d

I haven't tried it but since it accepts a decimal, the scientific notation is legal:

with maxspan = 3.3733E+15ms

The vast range of options here is trappy - from leniency (no time unit) to multiple units in different forms and fractional numbers. I propose to: a. remove leniency - no time unit specified is a bug

b. clarify the supported time units - seconds, minutes, hours, days (is that too big) and their supported names.

c. move away from fractional numbers and keep only integers As we've experienced in ES, fractional time units are tricky since millis & co have 1000 units, seconds, minutes, hours = 60 units while days - 24 units. A large enough fraction will cause a unit to be broken down to a large amount of very small time unit making it hard to compute.

Currently all expression types can be compounded together even though that do not make sense. For example NOT 5 + 1 , TRUE + 5 or NOT length("mystring", -4)

While the engine does handle this at runtime, it's essentially allowing unsupported / undefined queries to be inspected only to be, in the best case scenario, ignored. I'd like to revisit the grammar and take the SQL approach by breaking down the expression into types ((e.g. boolean expressions) and thus do small enforcement into the grammar itself.

elasticmachine commented 4 years ago

Pinging @elastic/es-ql (:Query Languages/EQL)

rw-access commented 4 years ago

A sequence can be declared with both join keys and sequence params but the order does not matter. However in all our examples and test suite only the first form is present.

I think if anything, this shows how insufficient our test suite is. We have examples of both in the EQL analytics library, and internally. A quick search shows both permutations: https://eqllib.readthedocs.io/en/latest/search.html?q=maxspan&check_keywords=yes&area=default#

Personally, I think this adds very little extra complexity to the grammar, and I think it's better to be more forgiving to the user than to seek the slightest bit of extra purity.

maxspan number and time units

We'd have to create additional units if we wanted to require integer-only maximum spans. I think it's more desirable and clear to say 0.5sec than 500ms, so I'd prefer to keep it.

separate bool expressions from non-bool

We can try that, but I'm not totally sure if it'll completely work or how far it will get us. Once we get to functions, we won't know what type it is until we resolve the function and check its return type.

I think the main question with that is: is it better to have an uninformed syntax error when you type in not 5, or is it better to specifically handle it and raise a more informative error for a type mismatch? Here's how this looks presently in EQL if you have strict boolean typing on, and not implicit coercion to bools:

>>> import eql
>>> with eql.ParserConfig(implied_booleans=False):
...    eql.parse_expression("not 5")
...
eql.errors.EqlTypeMismatchError: Error at line:1,column:5
Expected boolean, not number
not 5
    ^

The SQL grammar has the same problem, since we allow predicated through as a boolean expression. Those errors aren't caught until the verifier. https://github.com/elastic/elasticsearch/blob/5d590b5be6ffaa28d5fd8c07fe8cda0cb94c8272/x-pack/plugin/sql/src/main/antlr/SqlBase.g4#L183

GET /_sql
{
  "query": """
  select * from "endgame-*" where not 5
  """
}
{
  "error" : {
    "root_cause" : [
      {
        "type" : "verification_exception",
        "reason" : "Found 1 problem\nline 2:35: argument of [not 5] must be [boolean], found value [5] type [integer]"
      }
    ],
    "type" : "verification_exception",
    "reason" : "Found 1 problem\nline 2:35: argument of [not 5] must be [boolean], found value [5] type [integer]"
  },
  "status" : 400
}
GET /_sql
{
  "query": """
  select * from "endgame-*" where 'a' + 5
  """
}
{
  "error" : {
    "root_cause" : [
      {
        "type" : "verification_exception",
        "reason" : "Found 1 problem\nline 2:35: first argument of ['a' + 5] must be [numeric], found value ['a'] type [keyword]"
      }
    ],
    "type" : "verification_exception",
    "reason" : "Found 1 problem\nline 2:35: first argument of ['a' + 5] must be [numeric], found value ['a'] type [keyword]"
  },
  "status" : 400
}
costin commented 4 years ago

Personally, I think this adds very little extra complexity to the grammar, and I think it's better to be more forgiving to the user than to seek the slightest bit of extra purity. https://eqllib.readthedocs.io/en/latest/search.html?q=maxspan&check_keywords=yes&area=default#

I found only one use of a rule that uses with before by or 1/9 or 11%. I'm not sure the complexity in the whole ecosystem is trivial as any tool that looks for the pattern would have to take it into account.

We'd have to create additional units if we wanted to require integer-only maximum spans. I think it's more desirable and clear to say 0.5sec than 500ms, so I'd prefer to keep it.

How many units do you think are needed? In terms of readability 10ms is better than 0.01s/sec, 15ms vs 0.015s or 3s vs 0.05m. I'm +1 on adding ms as a unit for precise control than supporting floating points and all the baggage that they bring including lack of precision, huge range and double notation.

The SQL grammar has the same problem, since we allow predicated through as a boolean expression.

SQL is a bit more loose and while we do a good job in EQL to break down things into booleanExpressions I wonder whether we can improve that even further. SQL has subqueries and projections, EQL does not, it always returns a bool inside queries - is that something we can take into account? It's not predicated that I see the issue, which in EQL is used for IN, but rather booleanExpression going to valueExpression going to primaryExpression which allows a constant. NOT TRUE is good but NOT 5 is not.

colings86 commented 4 years ago

We'd have to create additional units if we wanted to require integer-only maximum spans. I think it's more desirable and clear to say 0.5sec than 500ms, so I'd prefer to keep it.

We removed fractional time values from other areas of Elasticsearch because it brings serialisation round trip issues and rounding issues. I'd prefer to also not allow fractional values in EQL for similar reasons. I also think we may find it very hard to support in EQL exactly because the search API does not allow fractional values.

colings86 commented 4 years ago

I think it's better to be more forgiving to the user than to seek the slightest bit of extra purity.

I think allowing flexibility for the user is a good thing but it needs to be weighed with the cost of providing ambiguous or surprising behaviours. It's also mot a great experience if there are two similar but subtly different ways of doing something and it's not very clear to the user which to use in which instance. I'm not saying this is the case here but rather than flexibility comes at a cost even to the user and we should weight the pros and cons to the user experience

rw-access commented 4 years ago

My main point bringing in the SQL example was to show that it has the exact same problem with NOT 5, so it wasn't a unique problem to EQL.

I think we can reorganize the grammar for boolean expressions if that makes it easier to maintain, follows better practice, etc. But I'm wary of making destructive changes. One plus-side for catching not 5 in the verifier is that we have more informative error messages. Seeing an error that says "you used a non-boolean value here but should've used a bool" is more informative to a user that made a mistake than having an invalid syntax error.

I found only one use of a rule that uses with before by or 1/9 or 11%.

For external rules, yes. For internal rules, I'm not 100% sure but will have to check.

supporting floating points and all the baggage that they bring including lack of precision, huge range and double notation.

Agreed that floating points are a nightmare, so those are really good points. Are there limited forms of syntax (such as decimal-only, no exponent, etc.) that would restrict the usage to avoid any of those?

I'll do a quick look to see what the currently used values of maxspan are in production. I have a feeling that the only fractional usage will be in sub-second sequences, since there was no smaller interval. Adding ms sounds like a good add, both for Endpoint EQL and for Elasticsearch EQL.

Update: Only integer timespans are currently used in public and private EQL queries. I can't speak to what customers have in their own environments, but I think we're safe to remove float support for queries and to add the ms time unit.

Are there currently any problems with sub-second intervals in ES? What is the time unit typically used for @timestamp? I think we'll still have to support sub-second intervals. I'm okay if we have to do that by having integer multiples of more granular time units.

I think allowing flexibility for the user is a good thing but it needs to be weighed with the cost of providing ambiguous or surprising behaviours. It's also mot a great experience if there are two similar but subtly different ways of doing something and it's not very clear to the user which to use in which instance.

Agree 100%. In this particular case (with vs by ordering for sequences), the behavior is identical, so I think we're okay.

I have a feeling that we'll need to provide some type of migration script to update EQL rules from the Endpoint platform to work in Elasticsearch. Particularly for schema changes. I think we can also handle additional things like converting time-units to integer values, etc.

costin commented 4 years ago

As this ticket has been discussed and concluded, it can be closed.