GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
618 stars 98 forks source link

Purge deleted datasets #3999

Open jbrown-xentity opened 2 years ago

jbrown-xentity commented 2 years ago

User Story

In order to keep CKAN data load under control, data.gov admin wants to purge removed datasets from catalog.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

Background

CKAN knows this is an outstanding issue (and we've raised it in the past), unfortunately no one has worked on it

There are currently 100K datasets that need to be removed:

SELECT COUNT(*), state FROM package GROUP BY state;
  count  |  state  
---------+---------
 1342011 | active
  138275 | deleted
       1 | draft

Security Considerations (required)

None

Sketch

The only way we know of to get the list of items that need to be removed is through the database, something like SELECT name, id FROM package WHERE state = 'deleted';. We would want to get this list, and then either call the CKAN CLI purge command or the CKAN API (something like datagov-dedupe).

hkdctol commented 2 years ago

Need to look at this if we run into speed issues

FuhuXia commented 1 year ago

Post some new stats. 30% of datasets are in deleted state . Purging them will reduce database size (by 30% ?), and maybe slightly increase page load performance. Haven't run the CKAN CLI purge command before, but these steps should work, tested with a previous ticket.

  1. Query all deleted harvest sources and clear them.
  2. Set all deleted state to to_delete.
  3. Clear the harvest source in the sandbox test org.

New stats:

catalog=# SELECT COUNT(*), state FROM package GROUP BY state;
 count  |  state
--------+---------
 373477 | active
 156490 | deleted
      1 | draft
jbrown-xentity commented 1 year ago

The purge command only works on a single dataset id, scripting that for 156490 datasets seems painful. There is an open ticket in CKAN on a command to handle this, but it hasn't been worked on... https://github.com/ckan/ckan/issues/4398. Basically it's a known problem with purging datasets at our scale.

FuhuXia commented 1 year ago

Deleted package remaining in the DB will slow down harvesting process and may bring in some unexpected issues, such as what we found in https://github.com/GSA/data.gov/issues/2989.

FuhuXia commented 1 year ago

A side effect of purging deleted dataset is that the add/update/delete count in the previous harvest reports will change, making the previous report meaningless. To overcome this, we can purge deleted datasets that are 1+ years old, in the meantime hide harvest reports that are 1+ year old.