awslabs / aws-glue-data-catalog-client-for-apache-hive-metastore

The AWS Glue Data Catalog is a fully managed, Apache Hive Metastore compatible, metadata repository. Customers can use the Data Catalog as a central repository to store structural and operational metadata for their data. AWS Glue provides out-of-box integration with Amazon EMR that enables customers to use the AWS Glue Data Catalog as an external Hive Metastore. This is an open-source implementation of the Apache Hive Metastore client on Amazon EMR clusters that uses the AWS Glue Data Catalog as an external Hive Metastore. It serves as a reference implementation for building a Hive Metastore-compatible client that connects to the AWS Glue Data Catalog. It may be ported to other Hive Metastore-compatible platforms such as other Hadoop and Apache Spark distributions
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html
Apache License 2.0
205 stars 120 forks source link

Reading data set with date filter does not work #45

Open stijndehaes opened 3 years ago

stijndehaes commented 3 years ago

The following minimal example results in an error:

from pyspark.sql.functions import col
from datetime import date
import random

source_data = []
for i in range(100):
    source_data.append((random.randint(18, 65), date.today()))
spark.createDataFrame(source_data, ["age", "date"]).write.mode("overwrite").partitionBy("date").format("parquet").saveAsTable("testtable")
spark.table("testtable").filter(col("date") == date.today()).show()
An error was encountered:
org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unsupported expression '2021 - 06 - 07' (Service: AWSGlue; Status Code: 400; Error Code: InvalidInputException; Request ID: 6c69a255-ec72-4e0e-9908-30f9a3bcff7c; Proxy: null)
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 485, in show
    print(self._jdf.showString(n, 20, vertical))
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
    raise converted from None
pyspark.sql.utils.AnalysisException: org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unsupported expression '2021 - 06 - 07' (Service: AWSGlue; Status Code: 400; Error Code: InvalidInputException; Request ID: 6c69a255-ec72-4e0e-9908-30f9a3bcff7c; Proxy: null)

This fails both with this library using spark 3.1.x and EMR 6.3.

This fails because of the following change in spark 3.1.x https://issues.apache.org/jira/browse/SPARK-33477

parisni commented 3 years ago

as a workaround turn off pruning :

spark.sql.hive.metastorePartitionPruning false
spark.sql.hive.convertMetastoreParquet false
map9000 commented 2 years ago

@stijndehaes SPARK-33477 added Hive partition pruning support for date type back in 2020 under [SPARK-33477][SQL] Hive Metastore support filter by date type #30408 https://github.com/apache/spark/pull/30408 . Why would this cause Glue 3 to not be able to perform filter by date type? (works if you convert date to string). Is the real cause of this issue Glue support for Hive needing to be updated?

stijndehaes commented 1 year ago

I created a PR to spark to fix this, it has just been accepted: https://github.com/apache/spark/pull/41035

parisni commented 1 year ago

@stijndehaes great. Any chance the fix could be back ported in 3.* instead of 3.5 as mentioned in the jira ?https://issues.apache.org/jira/browse/SPARK-43357

stijndehaes commented 1 year ago

@stijndehaes great. Any chance the fix could be back ported in 3.* instead of 3.5 as mentioned in the jira ?https://issues.apache.org/jira/browse/SPARK-43357

Good idea, I just asked on the JIRA ticket if that is possible.