GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
547 stars 87 forks source link

Harvest2.0 DB cleaning #4774

Open jbrown-xentity opened 3 weeks ago

jbrown-xentity commented 3 weeks ago

User Story

In order to keep the harvesting2.0 DB light and not fill up indefinitely, data.gov admins want a script to cleanup unnecessary data from the DB without losing current information in use.

Acceptance Criteria

Background

Should not be necessary for the Harvesting MVP, but will be required at some point in the future.

Security Considerations (required)

None.

Sketch

These are the expected process for production. This may or may not should be made configurable for testing purposes and future flexibility:

The no longer being used by catalog is a hard requirement and will always be in effect, the 90 day mark should be configurable. Ideally this will be a task that can be run on a regular basis, something like weekly/monthly.

rshewitt commented 3 weeks ago

so i figure to clean the harvester db records we'd want the opposite of how we get the most recent set of records (source) but for all harvest sources and only those which are older than 90 days. so maybe something like...

SELECT * FROM harvest_records WHERE ( NOW() - date_created ) > 90 
EXCEPT ALL
SELECT * FROM (
  SELECT DISTINCT ON (identifier) *
  FROM harvest_record
  WHERE status = 'success'
  ORDER BY identifier, date_created DESC ) sq
  WHERE sq.action != 'delete';
  )