possible to use glue catalog ?

awslabs / aws-glue-libs

AWS Glue Libraries are additions and enhancements to Spark for ETL operations.

Other

636 stars 300 forks source link

possible to use glue catalog ? #62

Closed moshir closed 2 years ago

moshir commented 4 years ago

Hi,

is it possible to use aws-glue-libs with Glue catalog. I was able to read a dataframe from the glue catalog. However, i could not find a way to get query the catalog from spark.sql, example:

spark.sql("show databases") # returns default
sparl.sql("use mydatabasefromawsglue") # pyspark.sql.utils.AnalysisException: u"Database 'mydatabasefromawsglue' not found;"

Any clue ?

PaulBurridge commented 4 years ago

You can create a DynamicFrame directly from the catalog as then use toDF() to convert to a DataFrame.

dynamic_frame = glueContext.create_dynamic_frame.from_catalog(database="xxx", table_name="xxx", transformation_ctx="datasource0")

data_frame = dynamic_frame.toDF()

moshir commented 4 years ago

Thank you Paul, however why is spark.sql("show databases") not working?

PaulBurridge commented 4 years ago

Just as guess but AWS have thier own PySpark extentions which have code to support things like the Glue Catalog and assist with authentication etc outside of these objects the normal PySpark objects are not aware of the Glue Catalog.

As the Glue Catalog is compatible with Apache Hive Metastore you may be able to connect via a method like this. Not tested myself, but good luck.

svajiraya commented 4 years ago

@moshir I think what you're trying to achieve can be done by patching the Spark build you're using. Follow the guide at https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore

I don't have a script handy to automate this, but i have done this in the past and i can confirm that this works.

when you run spark.sql, this doesn't access glue libs directly, but relies on hive metastore config in Spark. By default Spark uses Apache Derby embedded metastore when no metastore config is specified. For MySQL or other DB based metastore, you can just inject the config. However, for Glue metastore you will have to patch the jars in Spark build to get this to work.

moomindani commented 2 years ago

The latest Docker images support Glue Data Catalog integration natively. https://aws.amazon.com/blogs/big-data/develop-and-test-aws-glue-version-3-0-jobs-locally-using-a-docker-container/

When you use these containers, you do not need to install the above catalog client manually.