[X] Bug report. If you’ve found a bug, please provide a code snippet or test to reproduce it below.
The easier it is to track down the bug, the faster it is solved.
[ ] Feature Request. Start by telling us what problem you’re trying to solve.
Often a solution already exists! Don’t send pull requests to implement new features without
first getting our support. Sometimes we leave features out on purpose to keep the project small.
Issue description
According to the documentation, the preferred method for subsetting fields in a query through Spark SQL is by using the 'es.read.field.include' option (see Reading DataFrames - Controlling the DataFrame schema). According to the docs, filtering options should be pushed down to the ElasticQuery. However, when using this option alone, the actual queries sent to Elastic DO NOT include a source filtering option. Instead, all fields are queried and the specified fields are only subsetted after the data is returned.
Using built-in DataFrame methods like DataFrame.select(<field>) do not modify the underlying query sent to elastic either.
Adding a "_source" parameter to the query option using DataFrameReader.option('es.query', <query>) does not get passed to the underlying query either, although mappings in the "query" key are.
Finally, using the 'es.read.source.filter' option does modify the query sent to Elastic (by adding a "_source" parameter), but using it results in an error when the dataframe is operated on:
User specified source filters were found [name,timestamp], but the connector is executing in a state where it has provided its own source filtering [name,timestamp,location.address]. Please clear the user specified source fields under the [es.read.source.filter] property to continue. Bailing out...
Strack trace:
No errors were rasied. However, after setting logging on org.elasticsearch.hadoop.rest to TRACE, I saw the following in the logs for each method:
What kind an issue is this?
The easier it is to track down the bug, the faster it is solved.
Often a solution already exists! Don’t send pull requests to implement new features without first getting our support. Sometimes we leave features out on purpose to keep the project small.
Issue description
According to the documentation, the preferred method for subsetting fields in a query through Spark SQL is by using the 'es.read.field.include' option (see Reading DataFrames - Controlling the DataFrame schema). According to the docs, filtering options should be pushed down to the ElasticQuery. However, when using this option alone, the actual queries sent to Elastic DO NOT include a source filtering option. Instead, all fields are queried and the specified fields are only subsetted after the data is returned.
Using built-in DataFrame methods like
DataFrame.select(<field>)
do not modify the underlying query sent to elastic either.Adding a "_source" parameter to the query option using
DataFrameReader.option('es.query', <query>)
does not get passed to the underlying query either, although mappings in the "query" key are.Finally, using the 'es.read.source.filter' option does modify the query sent to Elastic (by adding a "_source" parameter), but using it results in an error when the dataframe is operated on:
which is addressed in the docs here.
Steps to reproduce
Code:
Strack trace: No errors were rasied. However, after setting logging on
org.elasticsearch.hadoop.rest
to TRACE, I saw the following in the logs for each method:Given these logs, and the time required to return data, it seems like no field/_source filter is pushed down.
Version Info
OS: DataBricks Runtime 10.4 LTS ML (runs on Ubuntu) JVM : 1.8.0_382 Hadoop/Spark: 3.2.1 ES-Hadoop : 8.10.0 ES: 8.12: