elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.79k stars 24.69k forks source link

ST_DISTANCE pushdown when distance is defined in EVAL has unnecessary field extraction and performance problems #114310

Open craigtaverner opened 2 days ago

craigtaverner commented 2 days ago

The work in https://github.com/elastic/elasticsearch/pull/112938 enables pushdown of ST_DISTANCE to lucene for both filtering and sorting, including when the distance result is expressed in a separate EVAL command:

FROM index
| EVAL distance=ST_DISTANCE(location, TO_GEOPOINT("POINT(0 0)"))
| WHERE distance < 20000
| SORT distance ASC
| KEEP name

In this query, we drop the distance attribute. The same would happen if we did a STATS:

FROM index
| EVAL distance=ST_DISTANCE(location, TO_GEOPOINT("POINT(0 0)"))
| WHERE distance < 20000
| STATS count=COUNT(*) BY country
| SORT count DESC, country, ASC

In both cases, the distance value does not need to be calculated, because the ST_DISTANCE function will be pushed to Lucene entirely. However, in the work done in https://github.com/elastic/elasticsearch/pull/112938, the column for distance will remain in the table of results all the way up to the KEEP or the STATS command and only then dropped. The consequences of this are that we are still about much slower than we need to be, because we perform unnecessary FieldExtract(location) and unnecessary ST_DISTANCE(location), only to drop those values. Early benchmarks show that when we push down without the EVAL command, we are at least 7x faster than this.

elasticsearchmachine commented 2 days ago

Pinging @elastic/es-analytical-engine (Team:Analytics)