apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.46k stars 2.23k forks source link

REST Catalog to support custom-catalog name like HMS/Glue #10205

Open osscm opened 6 months ago

osscm commented 6 months ago

Query engine

Trino

Question

We are thinking it would be good feature to support custom catalog name in REST Catalog. Before creating an issue, thought will discuss here, as we might be missing something.

Hive metastore and Glue supports catalog name, which can be used as namespace hierarchy like. <catalog>.<database/schema>.table

Glue catalog's property name is hive.metastore.glue.catalogid and for Hive metastore we have started working on adding that support. issue: https://github.com/trinodb/trino/issues/10287 changes: https://github.com/trinodb/trino/pull/21502

We are internally using custom-catalog with Metastore from time, and working with OSS community to share with the upstream.

Iceberg catalog supports custom namespace, but there is no concept of the catalog-name. I think if we support that as well, it will help to engines to use the existing model and less changes when adopting REST Catalog.

I might be missing something, so please help me to understand it.

jbonofre commented 6 months ago

It's included in the REST Catalog Spec proposal, via a handshake endpoint to provide catalog id.

osscm commented 6 months ago

Thanks @jbonofre Can you please point me in the Spec where it is specified and does REST catalog implementation like Tabular supports it?

jbonofre commented 6 months ago

That's the current proposal/discussion: https://docs.google.com/document/d/1JUtFpdEoa6IAKt1EzJi_re0PUbh56XnfUtRe5WAfl0s/edit?usp=sharing

REST Catalog is not yet a full "spec".

osscm commented 6 months ago

thanks @jbonofre !

osscm commented 6 months ago

cc @flyrain @RussellSpitzer @vgankidi

flyrain commented 6 months ago

The multipart namespace in the REST spec can support a use case like catalog.db.table.

jbonofre commented 6 months ago

@flyrain my understanding of the question is not on the namespace, but more at catalog level. Maybe I'm wrong :)

osscm commented 6 months ago

I also think catalog-name can be a separate entity.

flyrain commented 6 months ago

IIUC, the catalog-name is a concept introduced in Hive 3.0 to support one additional layer on top of database. I will consider the multipart namespace in REST spec is a superset of HMS table identifier format. For example, the rest spec supports a namespace like a.b.c.d.table, which is compatible with the HMS format catalog.db.table. A new concept like catalog-name isn't necessary in that sense.

osscm commented 6 months ago

thanks @flyrain

using namespace to include catalog-name can also work. though then what will be the use of catalog-id

Though IMO, as Spark and Trino using hive's catalog-name and understand it. It would be great, if RESTCatalog can also support it, and otherwise anyways namespace is there to support.

via a handshake endpoint to provide catalog id.

and we are also discussing the similar topic in Trino community as well. and @danielcweeks has talked about iceberg.rest-catalog.warehouse, so can we also use warehouse for the catalog-name

The name of the warehouse for the REST catalog will either be defined by the uri endpoint itself or can be provided by the iceberg.rest-catalog.warehouse property in the [REST catalog configuration](https://trino.io/docs/current/object-storage/metastores.html#rest-catalog). This is up to the backing REST catalog implementation.

https://github.com/trinodb/trino/issues/21813#issuecomment-2093743439

flyrain commented 6 months ago

iceberg.rest-catalog.warehouse normally points to a location like s3://my_bucket/warehouse_location. Not sure Trino community is OK to use it as the catalog name. Even if that works, do we need any change for REST Spec? I think multipart namespace still work well. Trino can concatenate the catalog_name to the rest table identifier in the client side, e.g. catalog_name.db1.table1.

danielcweeks commented 6 months ago

@flyrain This is already spelled out in the REST Spec. The warehouse property is just a value that is passed to a catalog implementation and has no specific requirement that it is a path. This was typically a path with something like HiveCatalog implementation or HadoopCatalog implementation, since that was then forwarded on to the underlying implementation.

flyrain commented 6 months ago

I think we are talking about the same thing here. I don't think there is a need to change the REST spec.

osscm commented 6 months ago

iceberg.rest-catalog.warehouse normally points to a location like s3://my_bucket/warehouse_location. Not sure Trino community is OK to use it as the catalog name. Even if that works, do we need any change for REST Spec? I think multipart namespace still work well. Trino can concatenate the catalog_name to the rest table identifier in the client side, e.g. catalog_name.db1.table1.

Trino right now passes Trino's catalog name (not the HMS catalog name) like this:

https://github.com/trinodb/trino/blob/master/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/catalog/rest/TrinoRestCatalog.java#L493

which is added to the session context properties.

SessionCatalog.SessionContext(sessionId, session.getUser(), credentials, properties, session.getIdentity());

@flyrain do you know?

is it to pass the trinoCatalog as header to the RESTCatalog service?

extracting header from the session context. https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/rest/RESTSessionCatalog.java#L303

danielcweeks commented 6 months ago

@osscm The information passed to the the REST server as part of the properties is contextual information and is only sent for user sessions. I wouldn't use this for resolving tables in any way.

Trino supports three part identifiers, the first of which is the catalog name from Trino's perspective. I feel like the best way to approach this is to use the warehouse property to identify what HMS catalog name is being addressed and just name the trino catalog and the hms catalog consistently.

The way this was handled prior to HMS supporting catalog names was to just use different URIs with different HMS servers. Warehouse allows you to differentiate the catalog being addressed and decoupling it from Trino configuration.

I don't feel like there's anything that should change either in Trino or the REST spec since there is a path to support this.

flyrain commented 6 months ago

I agreed with @danielcweeks. The Trino client can either pass the catalog name as a part of url(e.g. , $ENDPOINT/v1/prefix/namespaces/hms_catalog%1Fdb_name/tables/tablename), or pass the catalog name as an extra header, which we can sync-up internally.

ajantha-bhat commented 6 months ago

I feel like the best way to approach this is to use the warehouse property to identify what HMS catalog name is being addressed

It is so confusing (feels like a hacky solution) to use warehouse keyword to identify the catalog name. Warehouse has a different meaning for catalogs, using it as an identifier makes things confusing. Why can't we have a new property catalog-name?

jbonofre commented 6 months ago

For the context, the handshake endpoint of the "new" REST proposal can provide the catalog name, that's the kind of "configuration exchange". I agree that warehouse is confusing and not obvious to be used for catalog name 😄

ajantha-bhat commented 6 months ago

@jbonofre: I don't see much progress in terms of "new proposal" (no response from AWS guys), what are the next plans on that?

jbonofre commented 6 months ago

@ajantha-bhat i was waiting new review from AWS guys. I will move forward by creating a branch with jaxrs/open api annotated interfaces to illustrate the new proposal.

danielcweeks commented 6 months ago

Warehouse has a different meaning for catalogs . . .

There is no agreed upon standard for the naming of parts of an identifier, so while some people will find confusing, others will not. The term warehouse was taken from hive prior to their addition of a catalog concept. We can explore supporting alternatives, but I don't feel like making a change is necessary to address the issue proposed.

jbonofre commented 6 months ago

I agree with @danielcweeks here with the current REST API. I think it makes sense to include in the new REST Spec, but no need to be addressed with the current one.

ajantha-bhat commented 6 months ago

We can explore supporting alternatives, but I don't feel like making a change is necessary to address the issue proposed.

Agree.

github-actions[bot] commented 2 days ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.