elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.49k stars 24.88k forks source link

ESQL: LOOKUP JOIN's output inconsistent with JOIN's #117584

Closed alex-spies closed 2 days ago

alex-spies commented 3 days ago

The output computation for LOOKUP JOIN is still under construction, so this is not unexpected; c.f. meta issue https://github.com/elastic/elasticsearch/issues/116208.

However, currently we compute the expected output of a LOOKUP JOIN in a specific method in the Analyzer - which conflicts with the computation inside the Join class. This leads to inconsistencies. I believe there really should be only one place where the computation occurs, ideally centralized in the Join class so it's not spilled into the analyzer.

Reproducer:

$ curl -u elastic:password -H "Content-Type: application/json" "127.0.0.1:9200/languages_lookup" -XPUT -d '{
  "mappings": {
        "properties": {"language_code": {"type": "keyword"}, "language_name":{"type": "keyword"}}}, "settings": {"index": {"mode": "lookup"}}}'

$ curl -u elastic:password -HContent-Type:application/json 'localhost:9200/test/_doc?refresh' -d'{"x": 1}'

$ curl -u elastic:password -H "Content-Type: application/json" "127.0.0.1:9200/_query?format=txt" -d '{
    "query": "FROM test | eval language_code = x::keyword | limit 10 | lookup join languages_lookup on language_code"
}'

{"error":{"root_cause":[{"type":"illegal_state_exception","reason":"Found 1 problem\nline 1:58: Plan [Project[[x{f}#20, language_code{r}#18, language_name{f}#22]]] optimized incorrectly due to missing references [language_code{r}#18]"}],"type":"illegal_state_exception","reason":"Found 1 problem\nline 1:58: Plan [Project[[x{f}#20, language_code{r}#18, language_name{f}#22]]] optimized incorrectly due to missing references [language_code{r}#18]"},"status":500}
elasticsearchmachine commented 3 days ago

Pinging @elastic/es-analytical-engine (Team:Analytics)

alex-spies commented 3 days ago

This reproduces very nicely when the query does not discard the field that was join on; this works as reproducer in the csv test:

FROM employees
| SORT emp_no
| LIMIT 3
| EVAL language_code = languages
| LOOKUP JOIN languages_lookup ON language_code
| KEEP emp_no, language_code, language_name
;