Reading data set with date filter does not work

stijndehaes commented 3 years ago

The following minimal example results in an error:

from pyspark.sql.functions import col
from datetime import date
import random

source_data = []
for i in range(100):
    source_data.append((random.randint(18, 65), date.today()))
spark.createDataFrame(source_data, ["age", "date"]).write.mode("overwrite").partitionBy("date").format("parquet").saveAsTable("testtable")
spark.table("testtable").filter(col("date") == date.today()).show()

An error was encountered:
org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unsupported expression '2021 - 06 - 07' (Service: AWSGlue; Status Code: 400; Error Code: InvalidInputException; Request ID: 6c69a255-ec72-4e0e-9908-30f9a3bcff7c; Proxy: null)
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 485, in show
    print(self._jdf.showString(n, 20, vertical))
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
    raise converted from None
pyspark.sql.utils.AnalysisException: org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unsupported expression '2021 - 06 - 07' (Service: AWSGlue; Status Code: 400; Error Code: InvalidInputException; Request ID: 6c69a255-ec72-4e0e-9908-30f9a3bcff7c; Proxy: null)

This fails both with this library using spark 3.1.x and EMR 6.3.

This fails because of the following change in spark 3.1.x https://issues.apache.org/jira/browse/SPARK-33477

parisni commented 3 years ago

as a workaround turn off pruning :

spark.sql.hive.metastorePartitionPruning false
spark.sql.hive.convertMetastoreParquet false

map9000 commented 2 years ago

@stijndehaes SPARK-33477 added Hive partition pruning support for date type back in 2020 under [SPARK-33477][SQL] Hive Metastore support filter by date type #30408 https://github.com/apache/spark/pull/30408 . Why would this cause Glue 3 to not be able to perform filter by date type? (works if you convert date to string). Is the real cause of this issue Glue support for Hive needing to be updated?

stijndehaes commented 1 year ago

I created a PR to spark to fix this, it has just been accepted: https://github.com/apache/spark/pull/41035

parisni commented 1 year ago

@stijndehaes great. Any chance the fix could be back ported in 3.* instead of 3.5 as mentioned in the jira ?https://issues.apache.org/jira/browse/SPARK-43357

stijndehaes commented 1 year ago

@stijndehaes great. Any chance the fix could be back ported in 3.* instead of 3.5 as mentioned in the jira ?https://issues.apache.org/jira/browse/SPARK-43357

Good idea, I just asked on the JIRA ticket if that is possible.

awslabs / aws-glue-data-catalog-client-for-apache-hive-metastore

Reading data set with date filter does not work #45