[Feature Request]: Support read from / write to AWS Glue catalog tables backed by AWS S3 data

psolomin commented 1 year ago

What would you like to happen?

Overview

Glue catalog is a serverless metadata storage (databases, tables, schemas, partitions, connectors, etc). More: https://docs.aws.amazon.com/glue/latest/dg/glue-connections.html

It would be cool to have Beam supporting smth like:

GlueIO.write()
      .withClientConfiguration(awsClientConfiguration)
      .withDatabaseName("my_glue_db")
      .withTableName("my_glue_table")
      .withIOType(org.apache.beam.sdk.io.FileIO.class)
      .withIOConfig(... configs for FileIO / JdbcIO / ... )
      .withSchemaUpdateStrategy(ADD_NEW_COLUMNS | DISABLED)

Other existing implementations

For AWS S3-backed tables Spark on AWS EMR supports writing to a Glue table in a similar way it does for Hive Metastore tables:

df.write.saveAsTable("glue_db.glue_table")

Trino supports Glue catalog too - https://trino.io/docs/current/connector/hive.html - in a similar fashion Spark does - using it as a replacement for metadata of tables which are stored in some filesystem.
AWS Glue job (which is AWS fork of Spark) supports other types of storages: Mongo, RDS, etc
Flink seems to have it as work-in-progress: https://github.com/apache/flink-connector-aws/pull/47

Notes on possible implementation

Beam has HCatalogIO implementation - https://beam.apache.org/documentation/io/built-in/hcatalog/ - but it does not seem to be a good place for GlueIO:

it is highly coupled with Hive dependencies
it can not run on Java 11: https://github.com/apache/beam/issues/21299
Potentially GlueIO can support more storage types besides file systems - adding such into GlueIO can be easier
HCatalogIO currently doesn't have any machinery for AWS auth, coders, etc

Issue Priority

Priority: 2 (default / most feature requests should be filed as P2)

Issue Components

[ ] Component: Python SDK
[X] Component: Java SDK
[ ] Component: Go SDK
[ ] Component: Typescript SDK
[X] Component: IO connector
[ ] Component: Beam examples
[ ] Component: Beam playground
[ ] Component: Beam katas
[ ] Component: Website
[ ] Component: Spark Runner
[ ] Component: Flink Runner
[ ] Component: Samza Runner
[ ] Component: Twister2 Runner
[ ] Component: Hazelcast Jet Runner
[ ] Component: Google Cloud Dataflow Runner

mosche commented 1 year ago

@psolomin Do you have specific use cases in mind? Reading tables from the Glue meta store sounds like a useful integration! I'm not to sure about writing though, it feels a bit like that would conflict with the purpose of Glue preparing that data itself ... not sure though, I haven't used Glue much.

On the practical side, I'd expect more and more S3 backed tables catalogued in Glue to migrate to Iceberg / Hudi rather than using the fold Hive format. Beam not having support for these newer table formats might limit the value of such an IO. I've been thinking about working on an IcebergIO for Beam, but unfortunately won't have time for it in the near future.

psolomin commented 1 year ago

@mosche

Do you have specific use cases in mind?

Yeah, let me name some:

read from a Glue table backed by files in AWS S3 - FileIO will actually read files Glue catalog points to
read from a Glue table backed by AWS RDS (SQL) table - JdbcIO will actually read a table Glue catalog has

Writing is more tricky cause in that case Beam will need to edit Glue catalog objects

Glue preparing that data itself

"Glue" is actually multiple things: catalog (similar to Hive Metastore), Glue jobs (AWS proprietary version of serverless Spark), Glue crawlers, etc. For now this feature request is about Glue catalog only.

Iceberg

That one will be very useful, yes. Trino, Flink and Spark already have Iceberg support. Adding support for Iceberg will be more value added comparing to my feature request, I would say. And, as I remember, Iceberg can work without catalog or Hive metastore, and use table locations where both data & metadata is stored. Which means Iceberg support does not strictly requires catalog support. Still, in Spark I've seen Iceberg being used with either HMS or Glue catalog (select .. from fact.orders ..), with much less usage of table locations (spark.read.iceberg("s3://abc/fact.db/orders), and catalog support will become necessary.

mosche commented 1 year ago

@psolomin fyi https://github.com/apache/beam/issues/20327

raidancampbell commented 3 months ago

My Java / Maven is a bit rusty, but as far as I can tell, version 1.6.0 of org.apache.iceberg:iceberg-core has support for a catalog_type of Glue. I'm probably being a bit too naive here: would a simple dependency bump give support?

apache / beam