datahub-project / datahub

The Metadata Platform for your Data and AI Stack
https://datahubproject.io
Apache License 2.0
9.9k stars 2.93k forks source link

Deletion APIs #1761

Closed liangjun-jiang closed 4 years ago

liangjun-jiang commented 4 years ago

Is anyway we can delete our datasets easily? My understanding is that we don't have a set of deletion APIs that we can delete records from MySQL database, Elastic Search and Neo4j all together?

jywadhwani commented 4 years ago

In GMA, we support soft-deletion of entities by setting the removed flag in Status aspect as true. Looks like we do have Status aspect as part of DatasetAspect so simply updating this aspect should serve the purpose of marking a dataset as removed, but the record will still stay in DB. Are you looking for hard-deletion of datasets, instead?

jplaisted commented 4 years ago

In addition to what @jywadhwani, it is worth noting some of the historical reasons as to why we soft delete. @mars-lan probably has the most history here, but afaik we too often would have people asking us to undelete metadata. Which is quite difficult if it is hard deleted. However, to undeleted something that is soft deleted is as simple as just flipping that aspect back to false :)

I don't think we really want to expose hard deletion via an API, would be too easy to accidentally do that rather than soft, and the distinction may not be clear to users of that API. At LinkedIn we have quite a few teams putting metadata on DataHub, and want to scale it even more, meaning making our APIs safe for them is probably best. If we really need to hard delete something for some reason, we can just go into the DB / ElasticSearch / Neo4j ourselves and delete it manually.

I think that the OS DataHub UI may show removed Datasets in search results still; I think we recently made a change to stop that behavior, fyi.

Another advantage to soft deletion is that direct links to entities will always work. So say you make a dataset and provide the datahub link in some docs. Years later the doc is outdated, and the dataset deleted. The link will still work, but the UI will just show a big "REMOVED" tag on it.

mars-lan commented 4 years ago

@jywadhwani & @jplaisted have summarized it well. We purposely made all metadata aspect immutable and only soft delete entities. The idea is to both allow undoing of accidental deletion and to keep a long-running audit trail. At one point we did think about implementing auto garbage collection of long soft deleted entities & their aspects to keep the storage in check. However, as we plan to introduce support for NoSQL storage backend (e.g. MongoDB, Cassandra etc) in the near future, the need for GC becomes less given the horizontal scalability.

mars-lan commented 4 years ago

We should definitely capture this information somewhere in the doc though as others will probably start wondering the same in the future.

mars-lan commented 4 years ago

I guess it's sort of covered in https://github.com/linkedin/datahub/blob/master/docs/what/entity.md#what-is-an-entity without the mentioning of Status aspect as the source of removed flag and the rationale behind it.

mars-lan commented 4 years ago

Added a section here: https://github.com/linkedin/datahub/blob/master/docs/what/entity.md#how-to-delete-an-entity Hopefully this clears things up for you, @liangjun-jiang.