GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
646 stars 101 forks source link

disallow deletion of parent dataset within a Collection #4594

Open Jin-Sun-tts opened 9 months ago

Jin-Sun-tts commented 9 months ago

How to reproduce

Related #4553

in the local environment, find a collection and run ckan dataset purge/delete <collection_package_id>.

Expected behavior

The expected behavior is that for a collection, it should not be allowed to delete a parent dataset if there are still children datasets associated with it.

Actual behavior

The actual behavior observed is that parent datasets were deleted even when there were other datasets within the same collection.

Sketch

To address this issue, we need to implement logic at the delete stage. If a collection contains children datasets, the system should prevent the deletion of the parent dataset. This will ensure that parent datasets are retained as long as they have associated children datasets.

FuhuXia commented 9 months ago

One tricky situation to be included in the test is that a parent dataset and all its children datasets are to be deleted in one harvest job. In this case we should allow deleting the parent.

jbrown-xentity commented 9 months ago

I would consider this differently. If a parent dataset is removed without it's children, I would consider 2 scenarios of what a data provider would expect:

  1. The child datasets should be deleted as well
  2. The child datasets should have their reference to the parent removed, making them normal dataset records (and not associated with a non-existent parent)

Really this situation shouldn't exist if datasets are managed appropriately by data providers, but we can't rely on that. Every response seems like a hack. However, keeping the parent dataset when it's been removed from the source feels wrong.

jbrown-xentity commented 9 months ago

Not the same, but for reference we do something similar in datagov-dedupe, whereby if we need to remove a duplicate parent, we loop through all the children and make sure it points to the correct new parent. We could just remove the reference to the parent if the parent is deleted...

nickumia commented 9 months ago

The child datasets should have their reference to the parent removed, making them normal dataset records (and not associated with a non-existent parent)

As an outsider perspective, I think the second option seems more logical. Given that the child datasets still exist, it would make more sense to keep them and not have the relationship to the missing parent. If data providers intended to delete child datasets, they should be on the hook for it in managing their metadata catalog.

From a data system perspective, I think it makes more sense to ensure comparisons between agency source catalog match what's in the data.gov catalog.

From a user perspective, I can't say I have a good answer.