arangodb / arangodb-spark-datasource

ArangoDB Connector for Apache Spark, using the Spark DataSource API
Apache License 2.0
14 stars 11 forks source link

Why view created without WHERE condition, when load from AQL query? #36

Closed EbiousVi closed 1 year ago

EbiousVi commented 1 year ago

Why view created without WHERE condition, when load from AQL query?

We use:

            Map<String, String> opts = new HashMap<>();
            opts.put("query", "for user in users return user");

            spark
                    .read()
                    .format("com.arangodb.spark")
                    .options(opts)
                    .load()
                    .where("`active` == false")
                    .createOrReplaceTempView("v");

            Dataset<Row> sql = spark.sql("select `active` from v");
            sql.show();// returns data without where condition `active` == false

Thanks!

rashtao commented 1 year ago

Thanks for reporting, I verified that this is a bug in the connector.

Note that the expected behavior in this case would be applying the filter on the Spark side and not pushing it down to ArangoDB. Therefore all documents returned by the query will be transferred. As stated in the documentation:

Predicate and projection pushdowns are only performed while reading an ArangoDB collection (set by the table configuration parameter). In case of a batch read from a custom query (set by the query configuration parameter), no pushdown optimizations are performed.

For this specific case, reading from the users collection would be a better choice (i.e. setting the option table: users). In this case the filter would be pushed down to ArangoDB.

EbiousVi commented 1 year ago

Sorry, maybe I phrased my question wrong.

Most likely, my problem is not related to pushdown, but that there is no way to apply a filter at the spark level. Spark sql does not filter data by the generated view from the AQL query. My AQL query is abstract, this behavior is manifested with any query.

not working example not_working_example

working example, with .persist() it works as expected working_example