Closed xloya closed 6 months ago
Yes, it can be supported. Can you please draft a design doc about this, so that we can discuss on this?
Yes, it can be supported. Can you please draft a design doc about this, so that we can discuss on this?
Okay, I'll try to take the doc out this week
@xloya @jerryshao
Implementing a relation model is a litter complex. I think we should keep the interface com.datastrato.gravitino.EntityStore
not changed.
For now, I think we can use type string, key string, value string
instead of the key- value
model.
In key-value mode, we store entities like this;
key value
ta_{ml_id}_{ca_id}_{sc_id}_{table_id} ----- table_info
In relational mode, we can store entities like this:
Type key value
ta {ml_id}_{ca_id}_{sc_id}_{table_id} ----- table_info
The ListTable
can be translated to a SQL like:
select value from xx where type = 'ta' and key like '{ml_id}_{ca_id}_{sc_id}%'
We can create a unique index on the type
and key
for better performance.
What do you think about this?
@xloya @jerryshao
Implementing a relation model is a litter complex. I think we should keep the interface
com.datastrato.gravitino.EntityStore
not changed. For now, I think we can usetype string, key string, value string
instead of thekey- value
model.In key-value mode, we store entities like this;
key value ta_{ml_id}_{ca_id}_{sc_id}_{table_id} ----- table_info
In relational mode, we can store entities like this:
Type key value ta {ml_id}_{ca_id}_{sc_id}_{table_id} ----- table_info
The
ListTable
can be translated to a SQL like:select value from xx where type = 'ta' and key like '{ml_id}_{ca_id}_{sc_id}%'
We can create a unique index on the
type
andkey
for better performance. What do you think about this?
I'm not sure if using a relational database to emulate a KV store is a good practice. There are currently no examples of such use in a production environment. I may need to look at other's opinions. In addition, here is the doc that I simply compiled to support relational entity store: https://docs.google.com/document/d/1egaCTa5MvF-6seKQ-mxoGAS-0tFPzvyBeCt1Wq6bCl8/edit. I personally think that using a relational database to store relational data may be more in line with the meaning of components, and the performance may be better than fuzzy query.
We don't have to emulate the behavior of kv storage. The design above is all about optimizing the key for kv storage. If you're using RDBMS, you can design a relational schema, don't need to stick to kv's storage layout.
In Apache Kylin early versions, it uses HBase to store the metadata, which is a key-value structure. Later, we have to support MySQL and other relational database as the metadata storage as many users don't want to deploy HBase. After doing some investigation, we came out a simple design, which is a string typed "key" column, with a binary typed "value" column. It has worked for several years.
In Apache Kylin early versions, it uses HBase to store the metadata, which is a key-value structure. Later, we have to support MySQL and other relational database as the metadata storage as many users don't want to deploy HBase. After doing some investigation, we came out a simple design, which is a string typed "key" column, with a binary typed "value" column. It has worked for several years.
We know that's not the right case to use a SQL database, but that's a compromise among the complexity, ease of maintainence and others. (the metadata of Kylin is complicated, many types, and each type may have many properties, and may change from version to version)
In Apache Kylin early versions, it uses HBase to store the metadata, which is a key-value structure. Later, we have to support MySQL and other relational database as the metadata storage as many users don't want to deploy HBase. After doing some investigation, we came out a simple design, which is a string typed "key" column, with a binary typed "value" column. It has worked for several years.
We know that's not the right case to use a SQL database, but that's a compromise among the complexity, ease of maintainence and others. (the metadata of Kylin is complicated, many types, and each type may have many properties, and may change from version to version)
@jerryshao @justinmclean @yuqi1129 @xloya @YxAc @zhoukangcn
I think the suggestion from @shaofengshi is very valuable. There will be more entity types in the future. If we use a relation structure we have to modify the schema of entity store when changes occur. Another issue is that it is difficult to migrate from a relational structure to a key-value structure.
Before proceeding further with this work, I would like to hear your thoughts. Thanks.
I think we can bring this out to discussion in tomorrow's meeting. Both designs have pros and cons, maybe we can negotiate to find out a balance point.
In my opinion, the issues Kylin as an OLAP engine may face with metadata management are similar, but not necessarily identical, to metadata management systems such as Gravitino, Metacat, Hive Metastore, and Amoro. Internally, we use MySQL as the relational backend of the metadata management system, which has been used very maturely in the Hive Metastore production environment and Metacat. I think the most criticized thing about HMS is the impact of excessive transaction usage on read performance, not the issue of Schema changes. Regarding NetEase's Amoro project, I saw that they also use MySQL as a relational database. There doesn't seem to be any issue stating that they want to use MySQL as the KV store.
From my point, I think it is not the point that we use RDBMS or not as a metadata storage, the point is how do we design a SCHEMA that fits RDBMS well, shall we normalize all the fields, that's the thing we should discuss.
It is true there are practices that use MySQL as KV, and in metadata scenarios, there are several reasons why I prefer to use relational tables:
1) HMS's large scale production practice conveys confidence. Futhermore, Ali Cloud DLF, the unified metadata underlying layer also uses relational databases.
2) The main concern for using KV database is that it is difficult to partition a hot spot of a large table. Using relational table with sharding can solve this problem to a certain extent.
3) No need to migrate to KV database in the future, the complexity of KV maintenance is high, and there is low-cost both in terms of high availability and maintenance for MySQL.
Post the design doc of the table schema to here after online discussion on the multi-version issue of the relational model: https://docs.google.com/document/d/1vSicaybSJc7wHq6sAk4JR0eBy_PoO8HORSOihF4Ui90
Describe the feature
For some users, there may not be a suitable kv store for storing metadata. Should we consider supporting the relational entity store implementation, so that the user can use mysql or other relation database systems as the backend? FYI @jerryshao @coolderli @yuqi1129