ES|QL: support name qualifiers

astefan commented 2 months ago

Description

[ ] revisit https://github.com/elastic/elasticsearch/pull/110581 which removed the original implementation for qualifiers
[ ] implement qualifiers for LOOKUP https://github.com/elastic/elasticsearch/issues/112014
[ ] investigate and expand the name qualifiers to ENRICH https://github.com/elastic/elasticsearch/issues/112015

LOOKUP command is a powerful tool that helps users combine static tables of values with "live" data from Elasticsearch. It is consistent with the rest of the language features when it comes to same-name columns in that "the last column" wins.

For example (notice salary column that is present in employees index and salary column that is present in the static table benefits)

{
  "query": "FROM employees | WHERE languages IS NOT NULL | SORT salary DESC | KEEP first_name, last_name, languages, salary | LOOKUP benefits ON languages | LIMIT 5",
  "tables": {
    "benefits": {
      "languages":    {"integer": [1, 2, 3, 4, 5]},
      "salary":       {"integer": [100,200,300,400,500]},
      "bonus_percent":{"double":  [5,7,9,11,15]}
    }
  }
}

will result in

  first_name   |   last_name   |   languages   |    salary     | bonus_percent 
---------------+---------------+---------------+---------------+---------------
Moss           |Shanbhogue     |3              |300            |9.0            
Tzvetan        |Zielinski      |4              |400            |11.0           
Lillian        |Haddadi        |1              |100            |5.0            
Valter         |Sullins        |2              |200            |7.0            
Remzi          |Waschkowski    |3              |300            |9.0

salary column from employees is being replaced by the salary column from the benefits table. One can argue, though, that both columns are useful and must be kept. There is an workaround here and requires few changes to the query ("manually" copying the old values in a new column):

{
  "query": "FROM employees | WHERE languages IS NOT NULL | EVAL old_salary = salary | SORT old_salary DESC | KEEP first_name, last_name, languages, salary, old_salary | LOOKUP benefits ON languages | LIMIT 5",
  "tables": {
    "benefits": {
      "languages":    {"integer": [1, 2, 3, 4, 5]},
      "salary":       {"integer": [100,200,300,400,500]},
      "bonus_percent":{"double":  [5,7,9,11,15]}
    }
  }
}

which results in

  first_name   |   last_name   |   languages   |  old_salary   |    salary     | bonus_percent 
---------------+---------------+---------------+---------------+---------------+---------------
Moss           |Shanbhogue     |3              |74970          |300            |9.0            
Tzvetan        |Zielinski      |4              |74572          |400            |11.0           
Lillian        |Haddadi        |1              |73717          |100            |5.0            
Valter         |Sullins        |2              |73578          |200            |7.0            
Remzi          |Waschkowski    |3              |71165          |300            |9.0

But we can do better and have these additional steps be added automatically in the form of name qualifiers:

{
  "query": "FROM employees | WHERE languages IS NOT NULL | SORT salary DESC | KEEP first_name, last_name, languages, salary | LOOKUP benefits ON languages | LIMIT 5",
  "tables": {
    "benefits": {
      "languages":    {"integer": [1, 2, 3, 4, 5]},
      "salary":       {"integer": [100,200,300,400,500]},
      "bonus_percent":{"double":  [5,7,9,11,15]}
    }
  }
}

resulting in

  first_name   |   last_name   |   languages   |    salary     |employees.salary|benefits.salary| bonus_percent 
---------------+---------------+---------------+---------------+----------------+---------------+---------------
Moss           |Shanbhogue     |3              |74970          |74970           |300            |9.0            
Tzvetan        |Zielinski      |4              |74572          |74572           |400            |11.0           
Lillian        |Haddadi        |1              |73717          |73717           |100            |5.0            
Valter         |Sullins        |2              |73578          |73578           |200            |7.0            
Remzi          |Waschkowski    |3              |71165          |71165           |300            |9.0

elasticsearchmachine commented 2 months ago

Pinging @elastic/es-analytical-engine (Team:Analytics)

tylerperk commented 2 months ago

Users will eventually need a way to provide their own name (alias) for the qualifier, as the index names can be quite long (.ds-metrics-prometheus.collector-mki-ksm-crd-scraper-2024.08.21-000645 for example). You wouldn't want to have to type that when referring to a column. It's a another nice layer on top of what's proposed above but I think we should think about it now to avoid making it harder later.

dnhatn commented 2 months ago

What happens if the main index resolves to more than one index? For example, if logs-* resolves to both logs-1 and logs-2, what would the columns be in this case?

luigidellaquila commented 2 months ago

We have at least two more options here:

explicitly choose (and in case rename) the lookup columns:

LOOKUP benefits ON languages WITH benefits.salary = salary, bonus_percent`

It's the same syntax we have in ENRICH, so in terms of consistency it seems desirable

declare a general prefix for all the fields extracted from the lookup:
```
LOOKUP benefits ON languages PREFIX `benefits.`
```

I think it's important to point out that any implicit behavior (eg. an implicit prefix as the name of the lookup table) will be a breaking change and won't avoid collisions (eg. employees could have a field called benefits.salary)

alex-spies commented 2 months ago

LOOKUP benefits ON languages WITH benefits.salary = salary, bonus_percent` 
It's the same syntax we have in ENRICH, so in terms of consistency it seems desirable

That'd be consistent, but the problem with the WITH syntax is that it requires explicit aliasing of each and every field. That applies even more so to ENRICH: if the enrich index has many fields, you may accidentally overwrite your fields, or you'll have to list each and every field you want. Qualifiers are one way to solve this, by avoiding name conflicts altogether.

elastic / elasticsearch

ES|QL: support name qualifiers #112016

Description