apache / gravitino

World's most powerful open data catalog for building a high-performance, geo-distributed and federated metadata lake.
https://gravitino.apache.org
Apache License 2.0
959 stars 302 forks source link

[Improvement] clarify the semantics of drop metalake and catalog #3660

Closed mchades closed 1 week ago

mchades commented 4 months ago

What would you like to be improved?

metalake and catalog are completely managed by Gravitino, there are some drop behaviors that need to be clarified:

How should we improve?

answer above questions

mchades commented 4 months ago

I suggest clarify the dropping semantics of catalog and metalake as follows:

cc @jerryshao @shaofengshi @FANNG1

jerryshao commented 4 months ago

Hi @mchades I think there are some points need to think:

  1. Metalake doesn't only include catalog, but also system catalogs, like user, roles, tags, and metrics, so don't only think about catalogs. When deleting a metelake, do we need to delete all this information?
  2. Metelake is an tenant level concept which is very important to a organization, so do we need to handle the hard delete or soft unbinding, this should be thinking of.
  3. For catalogs, when we do we need to support deleting cascadingly, or only make sure that there's no schema existing we can delete the catalog.
  4. Do we need to provide a unbinding semantic instead of deleting sematic for catalogs and metalakes?
mchades commented 4 months ago

@jerryshao Thanks for your points!

Base on above four points, I propose the new drop rule for metalake:

  1. Add a in-use property to metalake with the default value of true.
  2. Only metalakes with in-use=false can be dropped.
  3. When a metalake is dropped, its associated sub-entities, such as catalog, user, role, tag, and metric will also be dropped together. (note: we don't need cascade here because the in-use property serves the same purpose)
  4. When in-use=false, all operations on the associated sub-entities of this metalake are rejected.
  5. return false if the catalog does not exist
  6. return true if drop successfully

For dropping catalog:

  1. Also add a in-use property to catalog with the default value of true.
  2. Only catalogs with in-use=false can be dropped.
  3. When a catalog is dropped, only its associated sub-entities in Gravitino store, such as schema and table, will also be dropped together.(For example, when dropping a Hive catalog(in-use=false ) , it's sub-entities will also be dropped but the metadata in HMS won't. )
    • why not drop external metadata? Because I think when we create a catalog, we just establish a connection (which can also be understood as a mapping relationship) between the external service and gravitino, so when deleting, we just need to cut off this connection (or remove this mapping relationship).
  4. When in-use=false, all operations on the associated sub-entities of this catalog are rejected.
  5. return false if the catalog does not exist
  6. return true if drop successfully
jerryshao commented 4 months ago

@jerryshao Thanks for your points!

Base on above four points, I propose the new drop rule for metalake:

  1. Add a in-use property to metalake with the default value of true.
  2. Only metalakes with in-use=false can be dropped.
  3. When a metalake is dropped, its associated sub-entities, such as catalog, user, role, tag, and metric will also be dropped together. (note: we don't need cascade here because the in-use property serves the same purpose)
  4. When in-use=false, all operations on the associated sub-entities of this metalake are rejected.
  5. return false if the catalog does not exist
  6. return true if drop successfully

For dropping catalog:

  1. Also add a in-use property to catalog with the default value of true.
  2. Only catalogs with in-use=false can be dropped.
  3. When a catalog is dropped, only its associated sub-entities in Gravitino store, such as schema and table, will also be dropped together.(For example, when dropping a Hive catalog(in-use=false ) , it's sub-entities will also be dropped but the metadata in HMS won't. )

    • why not drop external metadata? Because I think when we create a catalog, we just establish a connection (which can also be understood as a mapping relationship) between the external service and gravitino, so when deleting, we just need to cut off this connection (or remove this mapping relationship).
  4. When in-use=false, all operations on the associated sub-entities of this catalog are rejected.
  5. return false if the catalog does not exist
  6. return true if drop successfully

Let me think a bit on this.

jerryshao commented 4 months ago

I have several questions:

  1. is this in-use property set by user manually, right?
  2. The default drop behavior is cascadingly drop, right?
  3. What about the managed catalog like hadoop catalog, are we going to delete everything when catalog is dropped?

Can you please investigate the behavior of unity catalog, unity catalog has the same concept like metastore equals to our metalake, and catalog maps to our catalog. Besides, you'd also check starburst's gravity.

mchades commented 3 months ago
  1. is this in-use property set by user manually, right?

yes, and the privilege system should determine who can set this value.

  1. The default drop behavior is cascadingly drop, right?

yes, since in-use=false, it will be dropped cascadingly. Because I can't imagine a scenario where we need to delete Metalake or Catalog and still keep their sub-entities.

  1. What about the managed catalog like hadoop catalog, are we going to delete everything when catalog is dropped?

It's the same behavior with other catalogs, But it should be noted that we will only delete the data in the Gravitino store, and not in the Hadoop.

Can you please investigate the behavior of unity catalog, unity catalog has the same concept like metastore equals to our metalake, and catalog maps to our catalog. Besides, you'd also check starburst's gravity.

see the investigation, and the key conclusion is that when deleting the catalog(or metalake), external service data will not be deleted.

mchades commented 1 week ago

finish design