datastrato / gravitino

World's most powerful open data catalog for building a high-performance, geo-distributed and federated metadata lake.
https://datastrato.ai/docs/
Apache License 2.0
609 stars 193 forks source link

[FEATURE] Support Relational Entity Store #1811

Closed xloya closed 1 month ago

xloya commented 4 months ago

Describe the feature

For some users, there may not be a suitable kv store for storing metadata. Should we consider supporting the relational entity store implementation, so that the user can use mysql or other relation database systems as the backend? FYI @jerryshao @coolderli @yuqi1129

jerryshao commented 4 months ago

Yes, it can be supported. Can you please draft a design doc about this, so that we can discuss on this?

xloya commented 4 months ago

Yes, it can be supported. Can you please draft a design doc about this, so that we can discuss on this?

Okay, I'll try to take the doc out this week

coolderli commented 4 months ago

@xloya @jerryshao

Implementing a relation model is a litter complex. I think we should keep the interface com.datastrato.gravitino.EntityStore not changed. For now, I think we can use type string, key string, value string instead of the key- value model.

In key-value mode, we store entities like this;

key                                                                    value
ta_{ml_id}_{ca_id}_{sc_id}_{table_id}    -----    table_info

In relational mode, we can store entities like this:

Type      key                                                       value
ta        {ml_id}_{ca_id}_{sc_id}_{table_id}   ----- table_info

The ListTable can be translated to a SQL like:

select value from xx where type = 'ta' and key like '{ml_id}_{ca_id}_{sc_id}%'

We can create a unique index on the type and key for better performance. What do you think about this?

xloya commented 4 months ago

@xloya @jerryshao

Implementing a relation model is a litter complex. I think we should keep the interface com.datastrato.gravitino.EntityStore not changed. For now, I think we can use type string, key string, value string instead of the key- value model.

In key-value mode, we store entities like this;

key                                                                    value
ta_{ml_id}_{ca_id}_{sc_id}_{table_id}    -----    table_info

In relational mode, we can store entities like this:

Type      key                                                       value
ta        {ml_id}_{ca_id}_{sc_id}_{table_id}   ----- table_info

The ListTable can be translated to a SQL like:

select value from xx where type = 'ta' and key like '{ml_id}_{ca_id}_{sc_id}%'

We can create a unique index on the type and key for better performance. What do you think about this?

I'm not sure if using a relational database to emulate a KV store is a good practice. There are currently no examples of such use in a production environment. I may need to look at other's opinions. In addition, here is the doc that I simply compiled to support relational entity store: https://docs.google.com/document/d/1egaCTa5MvF-6seKQ-mxoGAS-0tFPzvyBeCt1Wq6bCl8/edit. I personally think that using a relational database to store relational data may be more in line with the meaning of components, and the performance may be better than fuzzy query.

jerryshao commented 4 months ago

We don't have to emulate the behavior of kv storage. The design above is all about optimizing the key for kv storage. If you're using RDBMS, you can design a relational schema, don't need to stick to kv's storage layout.

shaofengshi commented 4 months ago

In Apache Kylin early versions, it uses HBase to store the metadata, which is a key-value structure. Later, we have to support MySQL and other relational database as the metadata storage as many users don't want to deploy HBase. After doing some investigation, we came out a simple design, which is a string typed "key" column, with a binary typed "value" column. It has worked for several years.

shaofengshi commented 4 months ago

In Apache Kylin early versions, it uses HBase to store the metadata, which is a key-value structure. Later, we have to support MySQL and other relational database as the metadata storage as many users don't want to deploy HBase. After doing some investigation, we came out a simple design, which is a string typed "key" column, with a binary typed "value" column. It has worked for several years.

We know that's not the right case to use a SQL database, but that's a compromise among the complexity, ease of maintainence and others. (the metadata of Kylin is complicated, many types, and each type may have many properties, and may change from version to version)

coolderli commented 3 months ago

In Apache Kylin early versions, it uses HBase to store the metadata, which is a key-value structure. Later, we have to support MySQL and other relational database as the metadata storage as many users don't want to deploy HBase. After doing some investigation, we came out a simple design, which is a string typed "key" column, with a binary typed "value" column. It has worked for several years.

We know that's not the right case to use a SQL database, but that's a compromise among the complexity, ease of maintainence and others. (the metadata of Kylin is complicated, many types, and each type may have many properties, and may change from version to version)

@jerryshao @justinmclean @yuqi1129 @xloya @YxAc @zhoukangcn

I think the suggestion from @shaofengshi is very valuable. There will be more entity types in the future. If we use a relation structure we have to modify the schema of entity store when changes occur. Another issue is that it is difficult to migrate from a relational structure to a key-value structure.

Before proceeding further with this work, I would like to hear your thoughts. Thanks.

jerryshao commented 3 months ago

I think we can bring this out to discussion in tomorrow's meeting. Both designs have pros and cons, maybe we can negotiate to find out a balance point.

xloya commented 3 months ago

In my opinion, the issues Kylin as an OLAP engine may face with metadata management are similar, but not necessarily identical, to metadata management systems such as Gravitino, Metacat, Hive Metastore, and Amoro. Internally, we use MySQL as the relational backend of the metadata management system, which has been used very maturely in the Hive Metastore production environment and Metacat. I think the most criticized thing about HMS is the impact of excessive transaction usage on read performance, not the issue of Schema changes. Regarding NetEase's Amoro project, I saw that they also use MySQL as a relational database. There doesn't seem to be any issue stating that they want to use MySQL as the KV store.

jerryshao commented 3 months ago

From my point, I think it is not the point that we use RDBMS or not as a metadata storage, the point is how do we design a SCHEMA that fits RDBMS well, shall we normalize all the fields, that's the thing we should discuss.

YxAc commented 3 months ago

It is true there are practices that use MySQL as KV, and in metadata scenarios, there are several reasons why I prefer to use relational tables:

1) HMS's large scale production practice conveys confidence. Futhermore, Ali Cloud DLF, the unified metadata underlying layer also uses relational databases.

2) The main concern for using KV database is that it is difficult to partition a hot spot of a large table. Using relational table with sharding can solve this problem to a certain extent.

3) No need to migrate to KV database in the future, the complexity of KV maintenance is high, and there is low-cost both in terms of high availability and maintenance for MySQL.

xloya commented 3 months ago

Post the design doc of the table schema to here after online discussion on the multi-version issue of the relational model: https://docs.google.com/document/d/1vSicaybSJc7wHq6sAk4JR0eBy_PoO8HORSOihF4Ui90