Create Script to Purge Deleted Datasets from CKAN

btylerburton commented 3 months ago

Even when a dataset is "deleted" in CKAN it is still retained in the DB. Deleted datasets retain the URL they were given, thus forcing new harvests of that same source to post to a new URL. Datagovteam wants to create a script to purge a harvest source of all deleted datasets, thereby allowing us to clear a harvest source and, upon re-harvest, for the datasets to recover their original URLs.

How to reproduce

There are many ways that certain harvest sources have got in this state.

One being:

package is deleted by API
a new harvest is initiated
original URL is still retained by deleted dataset
dataset that is live is namespaced to account for origin url not being avaiable.

We can refine the steps as needed in future, but currently there is no single narrative of how this occurs nor does there need to be.

Expected behavior

When a dataset is removed and reharvested, then I can be assured that the original URL will be preserved upon reharvest.

Actual behavior

Datasets end up with integers or 5 character UUID's appended to their titles to account for the original URL being reserved.

Sketch

[ ] Create a script that accepts a harvest source id as an optional parameter
[ ] When that script is run with no parameters it will purge all deleted datasets from the db.
[ ] When that script is run with a harvest source id it will only purge all deleted datasets from the db.
[ ] This allows us to clear the harvest source and reharvest with the confidence that the original URLs will be preserved.

rshewitt commented 2 weeks ago

discussed with @FuhuXia and it seems like we have 2 options for freeing up datasets urls. i think this should be discussed as a group.

rename the dataset by appending its id to its name ( i think the url derives from the name but could be wrong ). then we would want to increase or remove the package name limit to get this to work then update the dataset state to "deleted". a benefit to this is that the deletion and freeing of the url is reversible.
purge the dataset completely. if we do this then we shouldn't run into a url-already-in-use error

btylerburton commented 2 weeks ago

I believe we'll want to go with Option 2. If in future we want to retain some history-saving feature we'd want to do that in the Flask app/Postgres DB.

FuhuXia commented 2 weeks ago

We can consider rewriting the story. The way it is written now means we can achieve the goal without purging deleted datasets. We just need to re-activate deleted datasets.

btylerburton commented 2 weeks ago

We can consider writing new Given When Then's. The story is written to only handle deletes, but if we want to consider logic to poll for the existence of previously deleted datasets first, that's fine too. I believe we'll still need a script to handle deletes as well though.

FuhuXia commented 2 weeks ago

Yes, a purge script will be nice to have. But its purpose is not working on a particular harvest source and free up the occupied names. Its purpose should be system-widely unclutter our DB to keep DB slim.

I can see there are three ways to address the original goal, i.e., deleted dataset should not cause url-already-in-use error.

As above mentioned, rename dataset before deleting them.
Never delete dataset. Always purge them.
Upon url-already-in-use, update the deleted dataset to activate it again, if they are from the same source.

Each has its pros and cons, but I think 3 is the most reasonable approach.

btylerburton commented 2 weeks ago

One more thing to find out: does the bulk delete API we're using to clear a harvest source, mark them as deleted, or does it purge?

FuhuXia commented 2 weeks ago

One more thing to find out: does the bulk delete API we're using to clear a harvest source, mark them as deleted, or does it purge?

No purge, just change datasets' state from active to deleted, and remove them from solr.

Jin-Sun-tts commented 2 weeks ago

One more thing to find out: does the bulk delete API we're using to clear a harvest source, mark them as deleted, or does it purge?

just mark them as 'deleted', FYI. here is the function: https://github.com/ckan/ckan/blob/master/ckan/logic/action/update.py#L1302

btylerburton commented 2 weeks ago

Is there some internal CKAN mechanism that compares metadata of incoming create object and determines it is same as deleted dataset and just undeletes it?

This is what I've observed:

harvest a datajson
dataset is created, exists at ex. /dataset/sample
perform harvest clear, which runs bulk delete under the hood
dataset is gone from UI, but visible with package show, state is 'deleted' as opposed to 'active'
harvest same datajson
dataset is revived under old URL

one would think that each time you harvest, you get the new combination of extra characters. logs say we are running create. why is old dataset getting revived?

FuhuXia commented 2 weeks ago

CKAN does not compare, or does not care. The compare function offered by harvester extension. Each harvester has a way to tell whether the new dataset is the same as a deleted dataset. For datajson type, it uses identifier. For WAF type, it uses the URL path. If the values are the same, the deleted one is updated with new info and and at the same time its state becomes active.

btylerburton commented 2 weeks ago

That's where I'm confused. Because our harvester considers these operations "creates", and is running package_create so how is it that the old datasets are getting revived instead of us seeing the new ones going live with 5 special characters?

On Wed, Nov 13, 2024 at 9:09 PM Fuhu Xia @.***> wrote:

CKAN does not compare, or does not care. The compare function offered by harvester extension. Each harvester has a way to tell whether the new dataset is the same as a deleted dataset. For datajson type, it uses identifier. For WAF type, it uses the URL path. If the values are the same, the deleted one is updated with new info and and at the same time its state becomes active.

— Reply to this email directly, view it on GitHub https://github.com/GSA/data.gov/issues/4852#issuecomment-2475321604, or unsubscribe https://github.com/notifications/unsubscribe-auth/AY45Y4J5MRYPTZPTE3EGLNL2AQH7RAVCNFSM6AAAAABM2X4GMSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINZVGMZDCNRQGQ . You are receiving this because you authored the thread.Message ID: @.***>

--

Tyler Burton Software Engineer Data.gov @.*** (612) 406-7653

FuhuXia commented 2 weeks ago

As of now, when harvester calls package_create and sees url-already-in-use, it adds 5 chars. No revival effort.

What should happen:

package_create to add dataset my-dataset, it sees url-already-in-use.
call package_show with id/name my-dataset
compare harvest source and identifier from the response with the payload. 3-1. if it is the same source and same identifier, call package_update with id/name my-dataset. This revive it. 3-2. else, adds 5 chars and call package_create again with name my-dataset-abcde.

GSA / data.gov