Closed moshir closed 2 years ago
You can create a DynamicFrame directly from the catalog as then use toDF()
to convert to a DataFrame.
dynamic_frame = glueContext.create_dynamic_frame.from_catalog(database="xxx", table_name="xxx", transformation_ctx="datasource0")
data_frame = dynamic_frame.toDF()
Thank you Paul, however why is spark.sql("show databases")
not working?
Just as guess but AWS have thier own PySpark extentions which have code to support things like the Glue Catalog and assist with authentication etc outside of these objects the normal PySpark objects are not aware of the Glue Catalog.
As the Glue Catalog is compatible with Apache Hive Metastore you may be able to connect via a method like this. Not tested myself, but good luck.
@moshir I think what you're trying to achieve can be done by patching the Spark build you're using. Follow the guide at https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore
I don't have a script handy to automate this, but i have done this in the past and i can confirm that this works.
when you run spark.sql
, this doesn't access glue libs directly, but relies on hive metastore config in Spark. By default Spark uses Apache Derby embedded metastore when no metastore config is specified. For MySQL or other DB based metastore, you can just inject the config. However, for Glue metastore you will have to patch the jars in Spark build to get this to work.
The latest Docker images support Glue Data Catalog integration natively. https://aws.amazon.com/blogs/big-data/develop-and-test-aws-glue-version-3-0-jobs-locally-using-a-docker-container/
When you use these containers, you do not need to install the above catalog client manually.
Hi,
is it possible to use aws-glue-libs with Glue catalog. I was able to read a dataframe from the glue catalog. However, i could not find a way to get query the catalog from spark.sql, example:
Any clue ?