Closed adborden closed 2 years ago
People had expressed concerns about migrating large data dumps from FCS to cloud.gov. The simplest way to sidestep that would be to pipe directly from mysqldump -> gzip -> sws s3 cp, which people do all the time. The S3 credentials would be for a bucket that we provision in our management
space; see the instructions for getting credentials for use outside of cloud.gov. Restore would work the same way from an application instance... Bind the S3 bucket, then run aws s3 cp -> gzip -dc -> mysql.
Plan to do dashboard, then inventory, then catalog, with static site being "whenever ready". Subject to change.
For reference, here's how the more general backup strategy will work.
Confirmed that we have CAA for letsencrypt.org at data.gov, which will be inherited for all subdomains (unless overridden).
Database Migration Commands:
# Create Temporary S3 Storage
S3_NAME=migrate-data
S3_KEY=md-key
cf create-service s3 basic $S3_NAME
S3_CREDENTIALS=`cf service-key "${S3_NAME}" "${S3_KEY}" | tail -n +2`
export AWS_ACCESS_KEY_ID=`echo "${S3_CREDENTIALS}" | jq -r .credentials.access_key_id`
export AWS_SECRET_ACCESS_KEY=`echo "${S3_CREDENTIALS}" | jq -r .credentials.secret_access_key`
export BUCKET_NAME=`echo "${S3_CREDENTIALS}" | jq -r .credentials.bucket`
export AWS_DEFAULT_REGION=`echo "${S3_CREDENTIALS}" | jq -r '.credentials.region'`
# Non-binary PSQL Dump
pg_dump --no-acl --no-owner --clean -T spatial_ref_sys -T layer -T topology ckan > ckan.dump
# Binary PSQL Dump
pg_dump --format=custom --no-acl --no-owner --clean -T spatial_ref_sys -T layer -T topology ckan > ckan.dump
# Pipe into S3
<pg_dump> | gzip | aws s3 cp - s3://${BUCKET_NAME}/<backup_name.sql.gz>
# Pipe out of S3
aws s3 cp s3://${BUCKET_NAME}/<backup_name.sql.gz> - | gzip -dc | <psql/pg_restore>
# Non-binary restore
PGPASSWORD=$DB_PASS psql -h $DB_HOST -U $DB_USER -p $DB_PORT $DB_NAME < <backup>
# Binary restore
PGPASSWORD=$DB_PASS pg_restore -h $DB_HOST -p $DB_PORT -U $DB_USER --no-owner --clean -d $DB_NAME < <backup>
# Local/Cloud.gov Restore
DB_USER=ckan
DB_PASS=ckan
DB_HOST=127.0.0.1
DB_PORT=5432
DB_NAME=ckan
# If no key exists,
# cf create-service-key <db_name> <db_key>
DB_CREDENTIALS=`cf service-key <db_name> <db_key> | tail -n +2`
export DB_NAME=`echo "${DB_CREDENTIALS}" | jq -r .credentials.db_name`
export DB_HOST=`echo "${DB_CREDENTIALS}" | jq -r .credentials.host`
export DB_USER=`echo "${DB_CREDENTIALS}" | jq -r .credentials.username`
export DB_PASS=`echo "${DB_CREDENTIALS}" | jq -r .credentials.password`
PGPASSWORD=$DB_PASS psql -h $DB_HOST -p $DB_PORT -U $DB_USER -c "create database ckan_temp;"
PGPASSWORD=$DB_PASS psql -h $DB_HOST -p $DB_PORT -U $DB_USER -d ckan_temp -c "drop extension IF EXISTS postgis cascade;"
PGPASSWORD=$DB_PASS psql -h $DB_HOST -p $DB_PORT -U $DB_USER -d ckan_temp -c "select pg_terminate_backend(pid) from pg_stat_activity where datname='ckan';"
PGPASSWORD=$DB_PASS psql -h $DB_HOST -p $DB_PORT -U $DB_USER -d ckan_temp -c "drop database $DB_NAME;"
PGPASSWORD=$DB_PASS psql -h $DB_HOST -p $DB_PORT -U $DB_USER -d ckan_temp -c "create database $DB_NAME;"
PGPASSWORD=$DB_PASS psql -h $DB_HOST -p $DB_PORT -U $DB_USER -c "create extension postgis;"
PGPASSWORD=$DB_PASS psql -h $DB_HOST -p $DB_PORT -U $DB_USER -c "drop database ckan_temp;"
# Binary or Non-binary restore from above,
PGPASSWORD=$DB_PASS pg_restore -h $DB_HOST -p $DB_PORT -U $DB_USER --no-owner --clean -d $DB_NAME < <binary_dump>
PGPASSWORD=$DB_PASS psql -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME < <non_binary_dump>
docker-compose exec ckan /bin/bash -c "ckan db upgrade"
docker-compose exec ckan /bin/bash -c "ckan search-index rebuild"
cf run-task catalog -c "ckan db upgrade"
cf run-task catalog -c "ckan db search-index rebuild"
Final DB Migration Script: https://gist.github.com/nickumia-reisys/8a5da2c3e33b9b7fb2ada263b9f9c52e
Steps to replicate:
01-fcs-catalog-db-migration.sh
script to create backup from FCS (needs to run on FCS production server).02-fcs-catalog-db-migration.sh
script locally to start restore process on cloud.gov.@nickumia-reisys since we needed some collaboration, I moved the scripts as "docs" or usage scripts for cf-backup-manager: https://github.com/GSA/cf-backup-manager/pull/18 (I also made some changes).
We finally got the ckan db upgrade
command to work, it took 7.25 hours to complete. See catalog logs on 10/12 to confirm.
I kicked off the ckan search-index rebuild
command just now, to see if it crashes and/or to get an estimate on how long it will take on the full DB (current best estimate is 5 days).
ckan search-index rebuild
crashed in 9 minutes with error code 137 (out of memory). Next steps for this ticket:
rebuild_fast
(docs) to speed up processDatabase does look like it is functional, Accessing package_show
on staging shows data equivalent to catalog fcs prod UI,
Staging api route: https://catalog-stage-datagov.app.cloud.gov/api/action/package_show?id=megapixel-mercury-cadmium-telluride-focal-plane-arrays-for-infrared-imaging-out-to-12-micr FCS Prod route: https://catalog.data.gov/dataset/megapixel-mercury-cadmium-telluride-focal-plane-arrays-for-infrared-imaging-out-to-12-micr
Staging api route: https://catalog-stage-datagov.app.cloud.gov/api/action/package_show?id=namma-lightning-zeus-data-v1 FCS Prod route: https://catalog.data.gov/dataset/namma-lightning-zeus-data-v1
Courtesy of @jbrown-xentity, To check how many collections have been indexed on catalog, go to https://catalog.data.gov/api/action/package_search?q=collection_metadata=true
I'm proposing that we don't need to take a new database dump and just run all of the harvest jobs on catalog production since it has all of the data since December 2021.
Harvesting activity is stopped. catalog.final.20220322.prod.gz
was saved on S3.
Database is restored and Solr registered 17k datasets to reindex. Solr reindex is currently running.
After database restore, we need to run an ANALYZE command to collect new statistics.
Catalog DB (prod) backup/restore times
Putting catalog on cloud.gov into a safe mode before migrating to minimize the web traffic and performance issues:
DatagovWAFCatalog
rate limit from 1200 per 5 min to 600 per 5 minAfter migration we will re-evaluate and gradually revert these changes to bring catalog back to normal.
============================ Reverting the protection after migration and thinks are OK.
DatagovWAFCatalog
rate limit from 1200 per 5 min to 600 per 5 minPointed current production CDN to catalog-web app on cloud.gov. Catalog.data.gov is officially migrated to cloud.gov, Things are looking fine, UI speed is good, catalog-web instances are stable, ECS Solr memory is normal. Will watch for the performance in the next a few days, and gradually take catalog.data.gov out of 'safe mode' and turn on harvesting.
Initial harvesting has been running for 4 days. Dataset count increased 57k. This is abnormal. Investigating data.json source duplicating issue now.
ckanext-datajson duplicate issue identified and fixed. Refreshing catalog with last FCS DB backup and reindexing Solr.
Change requests for staging
and production
saml2 app were submitted to login.gov. Hopefully they can be deployed this Thursday, but it might take up to 2 weeks.
https://zendesk.login.gov/hc/en-us/requests/1073
https://zendesk.login.gov/hc/en-us/requests/1074
The error we saw during pg_restore ERROR: schema "public" already exists
is due to making dump using cli version 9 but restoring using cli version 12, as discussed in this stackexchange thread. Two ways we can try to resolve it:
--schema=public -d target_db
IdP promoted to login.gov production, Migration completed.
User Story
In order to stop maintaining the FCS deployment, the data.gov team wants production service to be directed to our deployment on cloud.gov.
Acceptance Criteria
[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]
Background
[Any helpful contextual notes or links to artifacts/evidence, if needed]
Security Considerations (required)
[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!] This change will migrate us away from our old environment, which is harder to maintain and for which there are more things that we have to look after. The new environment has already been pen-tested and ATOd so we think it's going to be net win on attack surface overall.
Launch plan
Pre-launch
In the days leading up to the launch, these tasks should be completed:
cf create-private-domain gsa-datagov catalog.data.gov
cf map-route catalog catalog.data.gov
catalog.data.gov CNAME catalog.data.gov.external-domains-production.cloud.gov
~Launch
Tasks to be completed at the time of launch.
harvest_source_title
to evaluate what harvest sources have data that should probably be public).~catalog.data.gov CNAME catalog.data.gov.external-domains-production.cloud.gov
~catalog-prod-datagov.apps.cloud.gov
In the event a rollback is necessary, apply these tasks.