Migrate catalog to cloud.gov

adborden commented 3 years ago

User Story

In order to stop maintaining the FCS deployment, the data.gov team wants production service to be directed to our deployment on cloud.gov.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

[x] WHEN we visit catalog.data.gov in our browser \ THEN we see the expected catalog output \ AND we see requests in the catalog app's logs on cloud.gov

Background

[Any helpful contextual notes or links to artifacts/evidence, if needed]

Security Considerations (required)

[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!] This change will migrate us away from our old environment, which is harder to maintain and for which there are more things that we have to look after. The new environment has already been pen-tested and ATOd so we think it's going to be net win on attack surface overall.

Launch plan

Pre-launch

In the days leading up to the launch, these tasks should be completed:

[x] Review application configuration and all values are correct
[x] Ensure CAA record for letsencrypt.org exists for domain
[x] Familiarize yourself with the external domain service
[x] Open a DNS ticket for GSA to add _acme-challenge records for DNS validation (RITM0904225)
[x] Open a DNS ticket to reduce TTL for domains to 300
[x] Create domain cf create-private-domain gsa-datagov catalog.data.gov
[x] Map the domain route to the application in the prod space cf map-route catalog catalog.data.gov
[x] Confirm the production-ready checklist is complete
- [x] Application has a manifest.yml and is deployed from CI
- [x] Production application is running with at least 2 instances
- [x] Datastores are running "dedicated" or production grade plans
- [x] cf ssh access is disabled in the production space
- [x] NR Synthetics alerts are configured for this domain and reporting to #datagov-alerts
[x] Script datastore migrations them for repeatability
[x] Rehearse the launch and rollback plans below in staging, document any commands verbatim
[ ] ~Open a ticket with GSA DNS to coordinate the DNS switch over and update the DNS catalog.data.gov CNAME catalog.data.gov.external-domains-production.cloud.gov~
[x] Prep PR to update SAML configuration to use mirrored prod catalog app on identitysandbox.gov

Launch

Tasks to be completed at the time of launch.

[x] Stop automatic restarts of catalog
[x] ~Update harvest sources set to private to be public (use logged in catalog api call and search on harvest_source_title to evaluate what harvest sources have data that should probably be public).~
- ~_Note_all of the private harvest sources are listed here, but some of these are clearly "test" harvests that shouldn't be set to public_~ Deal this in another ticket.
[x] Stop harvest services in cloud.gov
[x] Stop harvest services in FCS
[x] Migrate datastores using these scripts
[x] Re-index
[ ] ~Have GSA update the DNS catalog.data.gov CNAME catalog.data.gov.external-domains-production.cloud.gov~
[x] Start harvest services in cloud.gov
[x] Switch GSAOPPPROD Cloudfront from FCS to catalog-prod-datagov.apps.cloud.gov
[x] Deploy SAML change and verify SAML login works with identitysandbox.gov
[x] Request to promote mirrored prod catalog app

In the event a rollback is necessary, apply these tasks.

[ ] Revert DNS changes, should point to original values in FCS
[ ] Ensure application services are started and running in FCS

mogul commented 3 years ago

People had expressed concerns about migrating large data dumps from FCS to cloud.gov. The simplest way to sidestep that would be to pipe directly from mysqldump -> gzip -> sws s3 cp, which people do all the time. The S3 credentials would be for a bucket that we provision in our management space; see the instructions for getting credentials for use outside of cloud.gov. Restore would work the same way from an application instance... Bind the S3 bucket, then run aws s3 cp -> gzip -dc -> mysql.

jbrown-xentity commented 3 years ago

Plan to do dashboard, then inventory, then catalog, with static site being "whenever ready". Subject to change.

mogul commented 2 years ago

For reference, here's how the more general backup strategy will work.

adborden commented 2 years ago

Confirmed that we have CAA for letsencrypt.org at data.gov, which will be inherited for all subdomains (unless overridden).

nickumia-reisys commented 2 years ago

Database Migration Commands:

# Create Temporary S3 Storage
S3_NAME=migrate-data
S3_KEY=md-key
cf create-service s3 basic $S3_NAME

S3_CREDENTIALS=`cf service-key "${S3_NAME}" "${S3_KEY}" | tail -n +2`
export AWS_ACCESS_KEY_ID=`echo "${S3_CREDENTIALS}" | jq -r .credentials.access_key_id`
export AWS_SECRET_ACCESS_KEY=`echo "${S3_CREDENTIALS}" | jq -r .credentials.secret_access_key`
export BUCKET_NAME=`echo "${S3_CREDENTIALS}" | jq -r .credentials.bucket`
export AWS_DEFAULT_REGION=`echo "${S3_CREDENTIALS}" | jq -r '.credentials.region'`

# Non-binary PSQL Dump
pg_dump --no-acl --no-owner --clean -T spatial_ref_sys -T layer -T topology ckan > ckan.dump

# Binary PSQL Dump
pg_dump --format=custom --no-acl --no-owner --clean -T spatial_ref_sys -T layer -T topology ckan > ckan.dump

# Pipe into S3
<pg_dump> | gzip | aws s3 cp - s3://${BUCKET_NAME}/<backup_name.sql.gz>

# Pipe out of S3
aws s3 cp s3://${BUCKET_NAME}/<backup_name.sql.gz> - | gzip -dc | <psql/pg_restore>

# Non-binary restore
PGPASSWORD=$DB_PASS psql -h $DB_HOST -U $DB_USER -p $DB_PORT $DB_NAME < <backup>

# Binary restore
PGPASSWORD=$DB_PASS pg_restore -h $DB_HOST -p $DB_PORT -U $DB_USER --no-owner --clean -d $DB_NAME < <backup>

# Local/Cloud.gov Restore
DB_USER=ckan
DB_PASS=ckan
DB_HOST=127.0.0.1
DB_PORT=5432
DB_NAME=ckan

# If no key exists,
# cf create-service-key <db_name> <db_key>
DB_CREDENTIALS=`cf service-key <db_name> <db_key> | tail -n +2`
export DB_NAME=`echo "${DB_CREDENTIALS}" | jq -r .credentials.db_name`
export DB_HOST=`echo "${DB_CREDENTIALS}" | jq -r .credentials.host`
export DB_USER=`echo "${DB_CREDENTIALS}" | jq -r .credentials.username`
export DB_PASS=`echo "${DB_CREDENTIALS}" | jq -r .credentials.password`

PGPASSWORD=$DB_PASS psql -h $DB_HOST -p $DB_PORT -U $DB_USER -c "create database ckan_temp;"
PGPASSWORD=$DB_PASS psql -h $DB_HOST -p $DB_PORT -U $DB_USER -d ckan_temp -c "drop extension IF EXISTS postgis cascade;"
PGPASSWORD=$DB_PASS psql -h $DB_HOST -p $DB_PORT -U $DB_USER -d ckan_temp -c "select pg_terminate_backend(pid) from pg_stat_activity where datname='ckan';"
PGPASSWORD=$DB_PASS psql -h $DB_HOST -p $DB_PORT -U $DB_USER -d ckan_temp -c "drop database $DB_NAME;"
PGPASSWORD=$DB_PASS psql -h $DB_HOST -p $DB_PORT -U $DB_USER -d ckan_temp -c "create database $DB_NAME;"
PGPASSWORD=$DB_PASS psql -h $DB_HOST -p $DB_PORT -U $DB_USER -c "create extension postgis;"
PGPASSWORD=$DB_PASS psql -h $DB_HOST -p $DB_PORT -U $DB_USER -c "drop database ckan_temp;"

# Binary or Non-binary restore from above,
PGPASSWORD=$DB_PASS pg_restore -h $DB_HOST -p $DB_PORT -U $DB_USER --no-owner --clean -d $DB_NAME < <binary_dump>
PGPASSWORD=$DB_PASS psql -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME < <non_binary_dump>

docker-compose exec ckan /bin/bash -c "ckan db upgrade"
docker-compose exec ckan /bin/bash -c "ckan search-index rebuild"
cf run-task catalog -c "ckan db upgrade"
cf run-task catalog -c "ckan db search-index rebuild"

nickumia-reisys commented 2 years ago

Final DB Migration Script: https://gist.github.com/nickumia-reisys/8a5da2c3e33b9b7fb2ada263b9f9c52e

Steps to replicate:

Run 01-fcs-catalog-db-migration.sh script to create backup from FCS (needs to run on FCS production server).
Run 02-fcs-catalog-db-migration.sh script locally to start restore process on cloud.gov.

jbrown-xentity commented 2 years ago

@nickumia-reisys since we needed some collaboration, I moved the scripts as "docs" or usage scripts for cf-backup-manager: https://github.com/GSA/cf-backup-manager/pull/18 (I also made some changes). We finally got the ckan db upgrade command to work, it took 7.25 hours to complete. See catalog logs on 10/12 to confirm. I kicked off the ckan search-index rebuild command just now, to see if it crashes and/or to get an estimate on how long it will take on the full DB (current best estimate is 5 days).

jbrown-xentity commented 2 years ago

ckan search-index rebuild crashed in 9 minutes with error code 137 (out of memory). Next steps for this ticket:

[x] Verify dataset integrity by using ckan api (/api/action/package_show?id=name)
[x] Try bumping memory in run-task (-k default 1G, -m default 512) to avoid memory issues, track closely at the beginning
Consider using rebuild_fast (docs) to speed up process
Consider turning off harvesters on prod now, so that when we get the right configuration we can "go", and not repeat all the steps in waterfall
If bumping memory does not work, examine if we need more solr instances (should google to see if there is an estimate on solr instances needed per number of records)
We could try to get a database dump of the dataset names, and break them up into digestable files, and running a task to index each dataset individually and parallelize.

nickumia-reisys commented 2 years ago

Database does look like it is functional, Accessing package_show on staging shows data equivalent to catalog fcs prod UI,

Staging api route: https://catalog-stage-datagov.app.cloud.gov/api/action/package_show?id=megapixel-mercury-cadmium-telluride-focal-plane-arrays-for-infrared-imaging-out-to-12-micr FCS Prod route: https://catalog.data.gov/dataset/megapixel-mercury-cadmium-telluride-focal-plane-arrays-for-infrared-imaging-out-to-12-micr

Staging api route: https://catalog-stage-datagov.app.cloud.gov/api/action/package_show?id=namma-lightning-zeus-data-v1 FCS Prod route: https://catalog.data.gov/dataset/namma-lightning-zeus-data-v1

nickumia-reisys commented 2 years ago

Courtesy of @jbrown-xentity, To check how many collections have been indexed on catalog, go to https://catalog.data.gov/api/action/package_search?q=collection_metadata=true

nickumia-reisys commented 2 years ago

I'm proposing that we don't need to take a new database dump and just run all of the harvest jobs on catalog production since it has all of the data since December 2021.