Netflix / iceberg

Iceberg is a table format for large, slow-moving tabular data
Apache License 2.0
472 stars 59 forks source link

Support for AWS Glue as an alternative Hive metastore implementation #112

Open ryanrupp opened 5 years ago

ryanrupp commented 5 years ago

Similar to the functionality in Presto I was wondering if Glue can be substituted in as an alternative implementation of a Hive metastore. Looking at the current HiveTableOperations it relies on:

get table
create table
alter table
an exclusive lock

The locking mechanism would be the problematic part as I don't believe an equivalent API is available in Glue. Possibly there's another approach or another service could be used for the locking functionality e.g. DynamoDB.

rdblue commented 5 years ago

I thought Glue exposed the same Thrift API that Hive uses. If that's the case, then we should be able to use the same lock API and code.

ryanrupp commented 5 years ago

I believe the API is partially implemented and doesn't include locking mechanisms unfortunately. Looking into it a bit when running on Spark EMR for instance, the HiveMetaStoreClientFactory can be overridden to specify AWSGlueDataCatalogHiveClientFactory see here. The implementation used there implements the basic Hive metastore operations e.g. create/alter/get table (calling back to the Glue public API) but UnsupportedOperationException is thrown for the lock method.

So, I was thinking the lock piece could be abstracted out where the generic Hive implementation uses the lock method via the Hive metastore but then a Glue override could use some other mechanism. So I guess mainly at this point it's a limitation of the Glue implementation but wanted to toss this out there as a nice to have for people not running their own Hive metastore.

ryanrupp commented 5 years ago

The client source was made available for Glue now for reference, see announcement. AWSCatalogMetastoreClient implements Hive's IMetaStoreClient and delegates to the GlueMetastoreClientDelegate although this only implements a subset of functionality so lock for instance just throws an unsupported operation exception here

rdblue commented 5 years ago

I think that Glue should implement locking as required by the interface it exposes. I'd be fine adding a solution specific to Glue in Iceberg as well, but I'm not sure what that would look like. Good to know that Glue won't work though.

teabot commented 5 years ago

Looking into it a bit when running on Spark EMR for instance

I believe there is ongoing work to have the HiveMetaStoreClientFactory abstraction contributed to vanilla Apache Hive:

https://issues.apache.org/jira/browse/HIVE-12679

On Fri, 7 Dec 2018 at 21:07, Ryan Rupp notifications@github.com wrote:

I believe the API is partially implemented and doesn't include locking mechanisms unfortunately. Looking into it a bit when running on Spark EMR for instance, the HiveMetaStoreClientFactory can be overridden to specify AWSGlueDataCatalogHiveClientFactory see here https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html. The implementation used there implements the basic Hive metastore operations e.g. create/alter/get table (calling back to the Glue public API) but UnsupportedOperationException is thrown for the lock method.

So, I was thinking the lock piece could be abstracted out where the generic Hive implementation uses the lock method via the Hive metastore but then a Glue override could use some other mechanism. So I guess mainly at this point it's a limitation of the Glue implementation but wanted to toss this out there as a nice to have for people not running their own Hive metastore.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Netflix/iceberg/issues/112#issuecomment-445365616, or mute the thread https://github.com/notifications/unsubscribe-auth/AAN-VqlejBd-TXdAUcUiyB5amA3-XdOJks5u2tiEgaJpZM4Y4lVs .