cityofaustin / atd-data-tech

Austin Transportation Data & Technology Services
18 stars 2 forks source link

Create VZ Backup to S3 ETL #10378

Closed patrickm02L closed 5 months ago

patrickm02L commented 2 years ago

Turn this gist into an ETL orchestrated by Airflow. It should run daily, but frequency can be adjusted once in the airflow UI.

work with @frankhereford

frankhereford commented 2 years ago

@roseeichelmann I'm excited to pair with you and all other interested developers on the spin-up of the Dockerized Agent system or discuss other methods or frameworks you'd like to employ on this task. Please feel free to catch me on slack, and we can work out a time that's good for you and others. Thanks!

frankhereford commented 1 year ago

I think that this can be done reliably from the database bastion host on ec2 with a command similar to this:

docker compose -f docker-compose.yml -f docker-compose-docker-volume.yml run --rm -e PGHOST=[snip].rds.amazonaws.com -e PGDATABASE=atd_vz_data -e PGUSER=username -e PGPASSWORD=supersecretpassword db-tools pg_dump --clean --create --no-owner --no-privileges --if-exists --exclude-table-data atd_txdot_change_log > vz_pg_dump.sql

I am setting this issue as a blocker to a new enhancement that I wrote up today, stemming from some frustrations I had earlier with the existing tooling, as seen here. I would be eager to combine the above script with a command line invocation to upload the output to S3, which then is ripe to be orchestrated with with prefect.

I'd be super happy to help with any of that if you'd like - thanks!

frankhereford commented 1 year ago

I have a script which backs up the database and uploads two versions of the export to S3 after compressing them. This is a reinvention of the old cron based solution, running on the rds-bastion host in EC2. The use of cron will tide us over until this task is completed: https://github.com/cityofaustin/atd-data-tech/issues/11722. cron also unblocks this issue: https://github.com/cityofaustin/atd-data-tech/issues/11715.

johnclary commented 5 months ago

I am closing this issue. @frankhereford @roseeichelmann if y'all think there is a reasonable cost/benefit to configuring S3 backups vs using RDS backups and putting our attention elsewhere we can revise this idea.