Open btylerburton opened 3 months ago
discussed with @FuhuXia and it seems like we have 2 options for freeing up datasets urls. i think this should be discussed as a group.
url-already-in-use
errorI believe we'll want to go with Option 2. If in future we want to retain some history-saving feature we'd want to do that in the Flask app/Postgres DB.
We can consider rewriting the story. The way it is written now means we can achieve the goal without purging deleted datasets. We just need to re-activate deleted datasets.
We can consider writing new Given When Then's. The story is written to only handle deletes, but if we want to consider logic to poll for the existence of previously deleted datasets first, that's fine too. I believe we'll still need a script to handle deletes as well though.
Yes, a purge script will be nice to have. But its purpose is not working on a particular harvest source and free up the occupied names. Its purpose should be system-widely unclutter our DB to keep DB slim.
I can see there are three ways to address the original goal, i.e., deleted dataset should not cause url-already-in-use
error.
url-already-in-use
, update the deleted dataset to activate it again, if they are from the same source.Each has its pros and cons, but I think 3 is the most reasonable approach.
One more thing to find out: does the bulk delete API we're using to clear a harvest source, mark them as deleted, or does it purge?
One more thing to find out: does the bulk delete API we're using to clear a harvest source, mark them as deleted, or does it purge?
No purge, just change datasets' state from active to deleted, and remove them from solr.
One more thing to find out: does the bulk delete API we're using to clear a harvest source, mark them as deleted, or does it purge?
just mark them as 'deleted', FYI. here is the function: https://github.com/ckan/ckan/blob/master/ckan/logic/action/update.py#L1302
Is there some internal CKAN mechanism that compares metadata of incoming create object and determines it is same as deleted dataset and just undeletes it?
This is what I've observed:
one would think that each time you harvest, you get the new combination of extra characters. logs say we are running create
. why is old dataset getting revived?
CKAN does not compare, or does not care. The compare function offered by harvester extension. Each harvester has a way to tell whether the new dataset is the same as a deleted dataset. For datajson type, it uses identifier. For WAF type, it uses the URL path. If the values are the same, the deleted one is updated with new info and and at the same time its state becomes active.
That's where I'm confused. Because our harvester considers these
operations "creates", and is running package_create
so how is it that the
old datasets are getting revived instead of us seeing the new ones going
live with 5 special characters?
On Wed, Nov 13, 2024 at 9:09 PM Fuhu Xia @.***> wrote:
CKAN does not compare, or does not care. The compare function offered by harvester extension. Each harvester has a way to tell whether the new dataset is the same as a deleted dataset. For datajson type, it uses identifier. For WAF type, it uses the URL path. If the values are the same, the deleted one is updated with new info and and at the same time its state becomes active.
— Reply to this email directly, view it on GitHub https://github.com/GSA/data.gov/issues/4852#issuecomment-2475321604, or unsubscribe https://github.com/notifications/unsubscribe-auth/AY45Y4J5MRYPTZPTE3EGLNL2AQH7RAVCNFSM6AAAAABM2X4GMSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINZVGMZDCNRQGQ . You are receiving this because you authored the thread.Message ID: @.***>
--
Tyler Burton Software Engineer Data.gov @.*** (612) 406-7653
As of now, when harvester calls package_create
and sees url-already-in-use
, it adds 5 chars. No revival effort.
What should happen:
package_create
to add dataset my-dataset
, it sees url-already-in-use
.package_show
with id/name my-dataset
package_update
with id/name my-dataset
. This revive it.
3-2. else, adds 5 chars and call package_create
again with name my-dataset-abcde
.
Even when a dataset is "deleted" in CKAN it is still retained in the DB. Deleted datasets retain the URL they were given, thus forcing new harvests of that same source to post to a new URL. Datagovteam wants to create a script to purge a harvest source of all deleted datasets, thereby allowing us to clear a harvest source and, upon re-harvest, for the datasets to recover their original URLs.
How to reproduce
There are many ways that certain harvest sources have got in this state.
One being:
We can refine the steps as needed in future, but currently there is no single narrative of how this occurs nor does there need to be.
Expected behavior
Actual behavior
Datasets end up with integers or 5 character UUID's appended to their titles to account for the original URL being reserved.
Sketch