apache / gravitino

World's most powerful open data catalog for building a high-performance, geo-distributed and federated metadata lake.
https://gravitino.apache.org
Apache License 2.0
1.02k stars 315 forks source link

[FEATURE] Support TableSchema Catalog to manage table schema (schema registry) #5230

Open coolderli opened 3 days ago

coolderli commented 3 days ago

Describe the feature

In Kafka and Fileset, we may need a table schema to deserialize data. We can manage the external schema registry in Gravitino.

Motivation

Describe the solution

We can introduce a TableSchemaCatalog to manage the TableSchema.

We can bind a table schema such as catalog.schema.table-schema to a topic or fileset when needed. So we can get the table schema from the external schema registry. We can also add a schema registry managed by gravitino, so we can directory save the table schema to the gravitino metastore.

img_v3_02fu_4b0b7b27-c424-4a5b-a0fc-42c6fb54318l

Additional context

No response

coolderli commented 3 days ago

@jerryshao @shaofengshi @caican00 @xloya @lw-yang What do you think? Any other thoughts about this?

xloya commented 2 days ago

I think it is a good idea to manage Schema as a resource at the same level as Table/Fileset/Messaging. In this way, we can distinguish between Managed Schema (data type is based on Gravitino) and External Schema (data type is based on the existing external Schema Registry or other systems). Then, in resources that require a specific Schema (such as some Filesets), we can bind a Schema to it. When obtaining Fileset metadata, we will also obtain the corresponding Schema and use it in some clients.

jerryshao commented 2 days ago

It's a bit strange that "schema" is an entity. Theoretically, the entity maps a data object, whereas "schema" is binding to the entity. We should think more about how to support this scenario.