Open jbrown-xentity opened 2 years ago
Need to look at this if we run into speed issues
Post some new stats. 30% of datasets are in deleted state . Purging them will reduce database size (by 30% ?), and maybe slightly increase page load performance. Haven't run the CKAN CLI purge command before, but these steps should work, tested with a previous ticket.
deleted
state to to_delete
.New stats:
catalog=# SELECT COUNT(*), state FROM package GROUP BY state;
count | state
--------+---------
373477 | active
156490 | deleted
1 | draft
The purge command only works on a single dataset id, scripting that for 156490 datasets seems painful. There is an open ticket in CKAN on a command to handle this, but it hasn't been worked on... https://github.com/ckan/ckan/issues/4398. Basically it's a known problem with purging datasets at our scale.
Deleted package remaining in the DB will slow down harvesting process and may bring in some unexpected issues, such as what we found in https://github.com/GSA/data.gov/issues/2989.
A side effect of purging deleted dataset is that the add/update/delete count in the previous harvest reports will change, making the previous report meaningless. To overcome this, we can purge deleted datasets that are 1+ years old, in the meantime hide harvest reports that are 1+ year old.
User Story
In order to keep CKAN data load under control, data.gov admin wants to purge removed datasets from catalog.
Acceptance Criteria
[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]
Background
CKAN knows this is an outstanding issue (and we've raised it in the past), unfortunately no one has worked on it
There are currently 100K datasets that need to be removed:
Security Considerations (required)
None
Sketch
The only way we know of to get the list of items that need to be removed is through the database, something like
SELECT name, id FROM package WHERE state = 'deleted';
. We would want to get this list, and then either call the CKAN CLI purge command or the CKAN API (something like datagov-dedupe).