cityofaustin / atd-vz-data

The technology that powers the City of Austin's Vision Zero program
https://visionzero.austin.gov/viewer/
11 stars 2 forks source link

CRIS import and CR3 ETLs #1478

Closed johnclary closed 3 weeks ago

johnclary commented 1 month ago

Associated issues

This is the new CRIS import, complete with CR3 pdf processing. This is ready for review, but please keep in mind these follow-up todos which I intend to address in follow-up issues (pending your feedback + approval):

Testing

Setup

  1. Start your local Vision Zero stack (database + Hasura + editor) using a recent copy of production

  2. from the ./atd-vzd directory, apply migrations and metadata:

$ hasura migrate apply
$ hasura metadata apply
  1. Grab a copy of the environment file from our 1pass dev vault. The item is named Env file for the Vision Zero new data model CRIS import ETL. Save it as .env in the ./atd-etl/data_model directory

  2. Build the docker image—you only need to do this once

# from ./atd-etl/data_model
$ docker compose build
  1. Start the Docker container and drop into it's shell
# from ./atd-etl/data_model
$ docker compose run cris_import
  1. It's time to run the CRIS import script! you're going to test a few different CLI options, described below.

End-to-end CRIS import

This will download each extract available in the S3 ./inbox, unzip it, load the CSV crash records into the database, crop crash diagrams out of the CR3 PDFs, and upload the CR3 pdfs and crash diagrams to the s3 bucket.

# from the cris_import container's shell
$ ./cris_import.py --csv --pdf --s3-download --s3-upload 
select * from cris_import_log order by id desc;

Local import

This will process the extract zips that were downloaded to your ./extracts directory during the previous step. CSVs will be loaded ino the db, and crash diagrams will be extracted but not uploaded to S3.

$ ./cris_import.py --csv --pdf

Archive and un-archive zips

The script can archive the extract zips by moving them from ./inbox to ./archive once they have been processed. This is intended for the production deployment, where the ./inobx functions as a work queue.

# download zips from ./inbox, process csvs, and archive the zips when done
$ ./cris_import.py --csv --s3-download --s3-archive

Restore the zips to the ./inbox using the helper script

$ python _restore_zips_from_archive.py

Other flags to test

# process any unzipped extract CSVs you have in your `./extracts` directory
$ ./cris_import.py --skip-unzip --csv

# process pdfs with more workers and in debug mode
$ ./cris_import.py --pdf --skip-unzip --s3-upload --workers 8 --verbose
johnclary commented 1 month ago

thanks again for the feedback. i am re-requesting review after having made the following changes:

i'm going to leave the _column_metadata improvements for a separate PR.

roseeichelmann commented 1 month ago

i notice when you restore from archive the extracts get copied over to the inbox but stay in the archived folder too. i guess that makes sense as expected behavior but just wanted to point it out incase yall think it makes more sense to move them?

johnclary commented 3 weeks ago

i notice when you restore from archive the extracts get copied over to the inbox but stay in the archived folder too. i guess that makes sense as expected behavior but just wanted to point it out incase yall think it makes more sense to move them?

i did this partially out of laziness, but was also thinking it's probably nice to have that breadcrumb of seeing the extract in the ./archive and knowing it was processed. happy to change this though 🤷

johnclary commented 3 weeks ago

@roseeichelmann i am tracking a few follow-up todos for the cris_import_log. will not forget them! 🙏