Open psolomin opened 1 year ago
@psolomin Do you have specific use cases in mind? Reading tables from the Glue meta store sounds like a useful integration! I'm not to sure about writing though, it feels a bit like that would conflict with the purpose of Glue preparing that data itself ... not sure though, I haven't used Glue much.
On the practical side, I'd expect more and more S3 backed tables catalogued in Glue to migrate to Iceberg / Hudi rather than using the fold Hive format. Beam not having support for these newer table formats might limit the value of such an IO. I've been thinking about working on an IcebergIO for Beam, but unfortunately won't have time for it in the near future.
@mosche
Do you have specific use cases in mind?
Yeah, let me name some:
Writing is more tricky cause in that case Beam will need to edit Glue catalog objects
Glue preparing that data itself
"Glue" is actually multiple things: catalog (similar to Hive Metastore), Glue jobs (AWS proprietary version of serverless Spark), Glue crawlers, etc. For now this feature request is about Glue catalog only.
Iceberg
That one will be very useful, yes. Trino, Flink and Spark already have Iceberg support. Adding support for Iceberg will be more value added comparing to my feature request, I would say. And, as I remember, Iceberg can work without catalog or Hive metastore, and use table locations where both data & metadata is stored. Which means Iceberg support does not strictly requires catalog support. Still, in Spark I've seen Iceberg being used with either HMS or Glue catalog (select .. from fact.orders ..
), with much less usage of table locations (spark.read.iceberg("s3://abc/fact.db/orders
), and catalog support will become necessary.
@psolomin fyi https://github.com/apache/beam/issues/20327
My Java / Maven is a bit rusty, but as far as I can tell, version 1.6.0 of org.apache.iceberg:iceberg-core
has support for a catalog_type
of Glue
. I'm probably being a bit too naive here: would a simple dependency bump give support?
What would you like to happen?
Overview
Glue catalog is a serverless metadata storage (databases, tables, schemas, partitions, connectors, etc). More: https://docs.aws.amazon.com/glue/latest/dg/glue-connections.html
It would be cool to have Beam supporting smth like:
Other existing implementations
Trino supports Glue catalog too - https://trino.io/docs/current/connector/hive.html - in a similar fashion Spark does - using it as a replacement for metadata of tables which are stored in some filesystem.
AWS Glue job (which is AWS fork of Spark) supports other types of storages: Mongo, RDS, etc
Flink seems to have it as work-in-progress: https://github.com/apache/flink-connector-aws/pull/47
Notes on possible implementation
Beam has
HCatalogIO
implementation - https://beam.apache.org/documentation/io/built-in/hcatalog/ - but it does not seem to be a good place forGlueIO
:GlueIO
can support more storage types besides file systems - adding such intoGlueIO
can be easierHCatalogIO
currently doesn't have any machinery for AWS auth, coders, etcIssue Priority
Priority: 2 (default / most feature requests should be filed as P2)
Issue Components