ckan / ckanext-spatial

Geospatial extension for CKAN
http://docs.ckan.org/projects/ckanext-spatial
125 stars 192 forks source link

Move datasets to delete first in line #261

Open jbrown-xentity opened 2 years ago

jbrown-xentity commented 2 years ago

We have reports at data.gov of datasets that get re-harvested with an extra 1 in the URL. We have confirmed these reports. It seems the harvest is doing the best it can to diagnose if this is a new dataset or not; but still failing in some circumstances. This probably won't fix the bug; however it will mitigate it. By running through the datasets removal first, if the spatial harvester is essentially doing a "delete and add" when it should be replacing, then the name of the new dataset won't collide with the one that is marked for deletion but still in the system. This will keep the URL the same, and not break as many workflows.

amercader commented 2 years ago

@jbrown-xentity It's been a long time since I worked on this but IIRC the harvesters call package_delete to delete a dataset, which will mark it as deleted but leave it on the database (as opposed to a package_purge call), which means that the dataset name can't be used when creating a new one. Can you expand on why changing the order in which "to delete" harvest objects are created helps in this case? (I'm sure the changes help, I just want to understand better)

jbrown-xentity commented 2 years ago

@amercader no, I believe you're right: we would need to purge the dataset. I forgot about that functionality. I believe we actually should be purging; I don't see a likely scenario where a user would want to keep or "revive" a dataset that was harvested and has been removed from source... I updated the PR to include the "purge" command instead of "delete".

ccancellieri commented 2 years ago

I'm experiencing a problem after having purged a dataset harvevsted. The next loop it will not be harvested anymore since the HarvestObject is still there tracking the date of last modification. As result you have to go (as I'm doing) in the DB to remove the harvest object by GUID.

I think that the purge may take care eventually of harvest object or... (since the core cant depend on an extension) we've to provide purge for harvest object table.