GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
649 stars 101 forks source link

CKAN purge job #2271

Open jbrown-xentity opened 4 years ago

jbrown-xentity commented 4 years ago

User Story

In order to free up resources from deleted datasets, data.gov team members want a regularly scheduled purge job to remove deleted resources.

Acceptance Criteria

Background

The purge functionality as an administrator is not working, due to the long nature of the command timing out in gunicorn. A command line option exists to purge a specific dataset, ~but the ckan jobs functionality added in 2.7 will probably need to be utilized to implement this appropriately.~ as well as a CKAN action.

Security Considerations (required)

~Any data removal should first be confirmed by the data managers/owners, as in the department of education case~.

SSP should be updated that deleted data is kept for X days (based on our configuration).

Sketch

Given that the purge command/action already exists on a per-dataset basis, we should use that. This is atomic and will ensure consistency.

The implementation could look like this:

# pseudo code
deleted_retention_days = config.get('deleted_dataset_retention_days', 90)
candidates_for_purge = ckan.package.find(state='deleted', last_modified=less_than(timespan(-deleted_retention_days)), limit=1000)
for dataset in candidates_for_purge:
  assert dataset.state == 'deleted'
  ckan.action.dataset_purge(dataset.id)

If the script crashes or times out, it will pick up where it left off.

I think to be shared between Catalog and Inventory, this should go into a new extension ckanext-maintenance which is not data.gov specific. Any other maintenance jobs can go there as well.

adborden commented 4 years ago

Let's make sure there's an issue opened upstream for this. I don't see a reason why CKAN wouldn't want the purge action to be more robust.

jbrown-xentity commented 4 years ago

Good call. Made comment on upstream ticket.

adborden commented 3 years ago

I updated the ticket with a sketch. I don't think we want to use jobs here. I think that's more for the web process handing off long running jobs. This task is really about a scheduled task to clean out deleted datasets each day.

adborden commented 3 years ago

FYI, CKAN 2.3.5 also has the purge command in CLI.