FraunhoferISST / diva

Scalable data management system with an AI powered profiling and metadata enrichment
https://fraunhoferisst.github.io/diva-docs/
Apache License 2.0
22 stars 3 forks source link

Delete concept for Entities #72

Closed setaman closed 3 years ago

setaman commented 3 years ago

Is your feature request related to a problem? Please describe.

DIVA has currently no official concept for deletion or deactivation of created entities (e.g. Resources, Assets, Users etc.). The only way is to hard delete the entities directly using corresponding Service API or through the Database

Describe the solution you'd like

So we need to provide a concept for deletion/deactivation/archiving. Actually, you don't want to have a hard delete at all. Because deleted data is deleted, can not be recovered. And certainly we should not give such power to our users without proper role management. Soft delete would be the better alternative. But there are also problems and opportunities here.

  1. Simple soft delete option: Mark the entity as deleted and let it in the original Database/collection/index

    • Extend entity schema with deletedAt time stamp field that marks an entity as deleted
    • Instead of delete update the entity and set the deletedAt
    • Create a special deleted history entry
    • Remove the entity from other Databases/collections/entities (on event) and let it only in the MongoDB collection
    • Can be simply restored anytime (just remove the deletion time stamp)

    Disadvantages:

  1. Soft archive option: Move the deleted entities to separate archive

Describe alternatives you've considered

The described flows are not mandatory, we can do some tweaks depending on concrete requirements and wishes. Additionally to soft archive i would suggest a disable or read only option to let the entity visible but deactivate any kind of editions. So probably we should have to options:

  1. Make entity read only - you can find and see it, but can't edit and profile
  2. Delete entity - entity will be soft archived like proposed above, but you can't search it (at least in a legal way). We can left a blank details page that says that it may have been deleted and shows a latest history entries, if any exists.

Also it would make sense to let the history entries or at least a few latest.

mspiekermann commented 3 years ago

I would argue to go with the "simple soft delete option" as it seems to me the lesser effort with a suitable solution. Also lesser side efects witth asset management etc. expected.

Nevertheless, the archiving approach is also valid but can maybe added as additional feature in the future (when use cases require this feature or we run into database sizes taht could't be handled otherwise). Thinking of a button for admins like "archive datasets" and then we iterate over all data sources with a deletedAt-stamp and transfer it to a separate archive database.

setaman commented 3 years ago

Measuring in implementation effort, both solutions a quite simple thanks to our technical infrastructure. The concepts differ most in their semantics. I'm not completely happy with the option 1 (very common in software word), as it slightly violates our API's semantics. It would call a PATCH instead of DELETE, and I don't want to produce an "delete" event on a PATCH. Other way around, I don't want to make a DELETE request as actually nothing will be deleted. These are just little things that make software engineers go crazy.

If we want a simple to go solution with soft delete, i would offer the following:

  1. Give the users "delete" button in client. Ask for delete confirmation and state that the resource with most important meta data can be restored
  2. Soft delete request makes a call to corresponding API PATCH route and patches the resources with { deleted: "<current date>"}. On PATCH, only a sub set of important meta data (e.g. title, uniqueFingerprint, entityType etc.) will be left there.
  3. The PATCH request produces normal update event
  4. The entity page reports, that the resource was deleted, shows few fields like title, resourceType etc., maybe latest history entries and offers the possibility to restore the entity
  5. Other components react to the update event:
    • Search Assistent: processes the update event and checks for deleted presence, hard removes the entity from index -> it's not searchable anymore
    • DIVA Lake Adapter: nothing, let the file in MinIO
    • Reviews Management:: nothing, user should still be able to access their reviews
    • History Management: a) let only first and last entries. b) nothing
    • Other entities management services: nothing, all by id linked entities still linked. We have currently relations only between one Asset and Resources/Assets/Users

      We should not just remove the entity from an Asset. It can lead to confusion when entities disappear just like that. Instead, we leave the link to the dead entity there and let the human decide, to remove, or not to remove.

  6. GET/{id} returns the "deleted" entity as usual
  7. All system components have to deal with the deleted field on their own
  8. GET filters the deleted entity from response (??)
  9. PATCH the entity with { deleted: "<current date>"} to restore it

@mspiekermann @DaTebe feel free to post your thoughts, if there is something to complain. I can start next week with this issue.

DaTebe commented 3 years ago

The first thing we should do is adjust the management services to support possible deletion or archiving in a correct way. These changes should be easy to make.

Delete:

  1. We should call the DELETE route
  2. The management service should hard delete the entity
  3. The management service should produce an delete event

Archive:

  1. we should call our "soft delete" "archive", then all engineers are happy with the terminology
  2. as already discussed, we can set an attribute (e.g. archivedAt) with a date
  3. what happens then needs to be discussed...

If we done all of this, the more complex questions arise. Maybe we can already agree on the steps I described. If yes, we can go deeper into the rabbit hole.

setaman commented 3 years ago

@DaTebe all management service can already DELETE.

Hard delete is obviously the easiest solution and a good one, until an entity is accidentally deleted. Then the requirement for archiving will arise.

We can start with a normal hard delete and see what happens. But "Archive" or "Backup" is an important concept for a system like DIVA, we should keep it in mind,

DaTebe commented 3 years ago

That sounds good. Don't get me wrong. We should implement both solutions in our backend. How we propagate it to our client needs to be discussed.

setaman commented 3 years ago

As discussed with @DaTebe, we start with hard delete, delete all possible traces of the entity (Histories, Search, Assets). Than we will incrementally add archive features, as the need arise.

github-actions[bot] commented 3 years ago

Branch 72-Delete-concept-for-Entities created for issue: Delete concept for Entities and assigned to null

setaman commented 3 years ago

@DaTebe hard delete is implemented in #80

DaTebe commented 3 years ago

@setaman what info is used in the dsc adapter to reference to a resource?

setaman commented 3 years ago

@DaTebe the resource objects holds the offerId, ruleId etc. under dsc.offer. That data is used to update offers on DSC. But all the problems, also with MinIO, would disappear with archive feature in the next iteration on this issue. So first we can left this as is

DaTebe commented 3 years ago

No, we can not ignore it. How do we delete the unreferenced data in our second "archive" iteration?

setaman commented 3 years ago

We could send an "archive" event. On this event the resource will be removed everywhere except from the original collection. Services would be able to read required data either from the original collection or from the archive, depends on implementation details. After this , the hard delete would not be an issue.

For now, i can duplicate the DSC info to another collection with the corresponding resource id, an than on delete read this by resource id

DaTebe commented 3 years ago

There was no way to store our uuid inside the dsc, correct?

setaman commented 3 years ago

correct

DaTebe commented 3 years ago

Another hacky solution would be to send the entire matadata as event. But we need to remember the max message size of 1MB...

Edit: we could also look into dedicated databases with key values to map from uuid to whatever the service needs to identify the resource.

setaman commented 3 years ago

Here another one: DSC allows us to put some additional properties. We could put the resource id to the offer. But DSC's API does not provide the possibility to filter the offers. So one would have to go through all the offers to find the needed one.

This all kind of hacky solutions we don't really want to implement.

DaTebe commented 3 years ago

I've tested our implementation according to our specification. Everything worked fine for me. Absolutely no hick ups.