develop data migration plan

WPRDC / wprdc-ops

Codeless repo for organizing project-wide work and issues.

0 stars 0 forks source link

develop data migration plan #13

Closed saylorsd closed 1 year ago

saylorsd commented 1 year ago

Current thoughts what will need to be done.

Write a script to migrate static, manually-updated datasets.
ETL jobs can probably just be rerun on the new instance
Some HUGE datasets with ETL jobs might need their own bespoke migration processes. Note these datasets

drw commented 1 year ago

This is the Google Sheet with the inventory of ETL jobs:

https://docs.google.com/spreadsheets/d/1amPsZ1RA9d9m41MXkMmMB8-VDmwAOxU72A35DlSWbT4/edit#gid=0

It sounds like we need an inventory of all CKAN resources and to identify which need to be manually copied and which could conceivably be handled by running ETL processes.

One question: What do we need to do to preserve package IDs, resource IDs, and metadata like creation date?

saylorsd commented 1 year ago

Thanks! I imagine using the API will cause some hassle with automatic metadata and the IDs.

We can explore what is available through the API to handle modifying IDs.

We can also look into how metadata for datastore datasets are handled by the database. My hope is that it'll be simple enough that we can do the migration on the db level, which will allow for direct transfer of those values. Especially considering that the GUIDs used for resources are the names of the datastore tables.

saylorsd commented 1 year ago

Currently having success with pg_dump piped to the new db. I'm running it ~1/16th at a time by filtering first character of the table names which are all UUIDs.

e.g. for the final set:

pg_dump --table 'f*' --create --clean --if-exists $SOURCE_CONNECTION_STRING | psql $DESTINATION_CONNECTION_STRING

saylorsd commented 1 year ago

to document further developments in migrating data. seems using --create disables --clean. this led to some tables getting extra data from a previous, botched full database dump before the aforementioned 1/16 at a time attempt.

Found some groups with a lot of duplicate data so we ran a new, cleansing dump for those groups. that led to a big drop off in table size mismatches.

The next step is running the script again on the few remaining incorrectly-sized tables.

this seems to work right

pg_dump --table 'f*' --clean --if-exists $SOURCE_CONNECTION_STRING | psql $DESTINATION_CONNECTION_STRING