apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.78k stars 4.22k forks source link

[Feature Request]: Support read from / write to AWS Glue catalog tables backed by AWS S3 data #26429

Open psolomin opened 1 year ago

psolomin commented 1 year ago

What would you like to happen?

Overview

Glue catalog is a serverless metadata storage (databases, tables, schemas, partitions, connectors, etc). More: https://docs.aws.amazon.com/glue/latest/dg/glue-connections.html

It would be cool to have Beam supporting smth like:

GlueIO.write()
      .withClientConfiguration(awsClientConfiguration)
      .withDatabaseName("my_glue_db")
      .withTableName("my_glue_table")
      .withIOType(org.apache.beam.sdk.io.FileIO.class)
      .withIOConfig(... configs for FileIO / JdbcIO / ... )
      .withSchemaUpdateStrategy(ADD_NEW_COLUMNS | DISABLED)

Other existing implementations

  1. For AWS S3-backed tables Spark on AWS EMR supports writing to a Glue table in a similar way it does for Hive Metastore tables:
df.write.saveAsTable("glue_db.glue_table")
  1. Trino supports Glue catalog too - https://trino.io/docs/current/connector/hive.html - in a similar fashion Spark does - using it as a replacement for metadata of tables which are stored in some filesystem.

  2. AWS Glue job (which is AWS fork of Spark) supports other types of storages: Mongo, RDS, etc

  3. Flink seems to have it as work-in-progress: https://github.com/apache/flink-connector-aws/pull/47

Notes on possible implementation

Beam has HCatalogIO implementation - https://beam.apache.org/documentation/io/built-in/hcatalog/ - but it does not seem to be a good place for GlueIO:

Issue Priority

Priority: 2 (default / most feature requests should be filed as P2)

Issue Components

mosche commented 1 year ago

@psolomin Do you have specific use cases in mind? Reading tables from the Glue meta store sounds like a useful integration! I'm not to sure about writing though, it feels a bit like that would conflict with the purpose of Glue preparing that data itself ... not sure though, I haven't used Glue much.

On the practical side, I'd expect more and more S3 backed tables catalogued in Glue to migrate to Iceberg / Hudi rather than using the fold Hive format. Beam not having support for these newer table formats might limit the value of such an IO. I've been thinking about working on an IcebergIO for Beam, but unfortunately won't have time for it in the near future.

psolomin commented 1 year ago

@mosche

Do you have specific use cases in mind?

Yeah, let me name some:

Writing is more tricky cause in that case Beam will need to edit Glue catalog objects

Glue preparing that data itself

"Glue" is actually multiple things: catalog (similar to Hive Metastore), Glue jobs (AWS proprietary version of serverless Spark), Glue crawlers, etc. For now this feature request is about Glue catalog only.

Iceberg

That one will be very useful, yes. Trino, Flink and Spark already have Iceberg support. Adding support for Iceberg will be more value added comparing to my feature request, I would say. And, as I remember, Iceberg can work without catalog or Hive metastore, and use table locations where both data & metadata is stored. Which means Iceberg support does not strictly requires catalog support. Still, in Spark I've seen Iceberg being used with either HMS or Glue catalog (select .. from fact.orders ..), with much less usage of table locations (spark.read.iceberg("s3://abc/fact.db/orders), and catalog support will become necessary.

mosche commented 1 year ago

@psolomin fyi https://github.com/apache/beam/issues/20327

raidancampbell commented 1 month ago

My Java / Maven is a bit rusty, but as far as I can tell, version 1.6.0 of org.apache.iceberg:iceberg-core has support for a catalog_type of Glue. I'm probably being a bit too naive here: would a simple dependency bump give support?