awslabs / aws-glue-data-catalog-client-for-apache-hive-metastore

The AWS Glue Data Catalog is a fully managed, Apache Hive Metastore compatible, metadata repository. Customers can use the Data Catalog as a central repository to store structural and operational metadata for their data. AWS Glue provides out-of-box integration with Amazon EMR that enables customers to use the AWS Glue Data Catalog as an external Hive Metastore. This is an open-source implementation of the Apache Hive Metastore client on Amazon EMR clusters that uses the AWS Glue Data Catalog as an external Hive Metastore. It serves as a reference implementation for building a Hive Metastore-compatible client that connects to the AWS Glue Data Catalog. It may be ported to other Hive Metastore-compatible platforms such as other Hadoop and Apache Spark distributions
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html
Apache License 2.0
198 stars 118 forks source link

Slow sql performance #42

Open talalryz opened 3 years ago

talalryz commented 3 years ago

Running,

spark.read.table('database.table').limit(10).show()

is a lot faster than running,

spark.sql('SELECT * from database.table limit 10')

Intuitively, we would expect both of these operations to have similar run times. Looking a bit deeper, it seems that spark.sql is forcing an entire file scan, where as spark.read.table.limit does not. This problem extends all the way to filtering by partition cols as well e.g., spark.sql('SELECT * from database.table where partition_col=<value>') also forces a full table scan, while using spark.read.table.filter does not.

Is there something I could be missing, e.g., a spark configuration that could be causing this or is this a known issue?

pancodia commented 1 year ago

Same slow performance for me when using Pyspark on EMR with Glue metastore. But for me there is no difference between the two query methods.

Also found for me if I filter by partition columns, data scan is reduced therefore query speeds up.

heetu commented 1 year ago

Hi @talalryz, and @pancodia, did you get any conclusion?