Closed liangjun-jiang closed 4 years ago
In GMA, we support soft-deletion of entities by setting the removed
flag in Status aspect as true
. Looks like we do have Status
aspect as part of DatasetAspect
so simply updating this aspect should serve the purpose of marking a dataset as removed
, but the record will still stay in DB. Are you looking for hard-deletion of datasets, instead?
In addition to what @jywadhwani, it is worth noting some of the historical reasons as to why we soft delete. @mars-lan probably has the most history here, but afaik we too often would have people asking us to undelete metadata. Which is quite difficult if it is hard deleted. However, to undeleted something that is soft deleted is as simple as just flipping that aspect back to false
:)
I don't think we really want to expose hard deletion via an API, would be too easy to accidentally do that rather than soft, and the distinction may not be clear to users of that API. At LinkedIn we have quite a few teams putting metadata on DataHub, and want to scale it even more, meaning making our APIs safe for them is probably best. If we really need to hard delete something for some reason, we can just go into the DB / ElasticSearch / Neo4j ourselves and delete it manually.
I think that the OS DataHub UI may show removed Datasets in search results still; I think we recently made a change to stop that behavior, fyi.
Another advantage to soft deletion is that direct links to entities will always work. So say you make a dataset and provide the datahub link in some docs. Years later the doc is outdated, and the dataset deleted. The link will still work, but the UI will just show a big "REMOVED" tag on it.
@jywadhwani & @jplaisted have summarized it well. We purposely made all metadata aspect immutable and only soft delete entities. The idea is to both allow undoing of accidental deletion and to keep a long-running audit trail. At one point we did think about implementing auto garbage collection of long soft deleted entities & their aspects to keep the storage in check. However, as we plan to introduce support for NoSQL storage backend (e.g. MongoDB, Cassandra etc) in the near future, the need for GC becomes less given the horizontal scalability.
We should definitely capture this information somewhere in the doc though as others will probably start wondering the same in the future.
I guess it's sort of covered in https://github.com/linkedin/datahub/blob/master/docs/what/entity.md#what-is-an-entity without the mentioning of Status
aspect as the source of removed
flag and the rationale behind it.
Added a section here: https://github.com/linkedin/datahub/blob/master/docs/what/entity.md#how-to-delete-an-entity Hopefully this clears things up for you, @liangjun-jiang.
Is anyway we can delete our datasets easily? My understanding is that we don't have a set of deletion APIs that we can delete records from MySQL database, Elastic Search and Neo4j all together?