CRIS import and CR3 ETLs

johnclary commented 1 month ago

Associated issues

This is the new CRIS import, complete with CR3 pdf processing. This is ready for review, but please keep in mind these follow-up todos which I intend to address in follow-up issues (pending your feedback + approval):

Archive the old ETLs and rename this ETL subdirectory
Expand readme
Consolidate / clean up our various S3 buckets
Revisit / extend the logging in the cris_import_log table, and probably rename that table to _cris_import_log
Github CI, 1pass integration, airflow deployment, etc.

Testing

Setup

Start your local Vision Zero stack (database + Hasura + editor) using a recent copy of production
from the ./atd-vzd directory, apply migrations and metadata:

$ hasura migrate apply
$ hasura metadata apply

Grab a copy of the environment file from our 1pass dev vault. The item is named Env file for the Vision Zero new data model CRIS import ETL. Save it as .env in the ./atd-etl/data_model directory
Build the docker image—you only need to do this once

# from ./atd-etl/data_model
$ docker compose build

Start the Docker container and drop into it's shell

# from ./atd-etl/data_model
$ docker compose run cris_import

It's time to run the CRIS import script! you're going to test a few different CLI options, described below.

End-to-end CRIS import

This will download each extract available in the S3 ./inbox, unzip it, load the CSV crash records into the database, crop crash diagrams out of the CR3 PDFs, and upload the CR3 pdfs and crash diagrams to the s3 bucket.

# from the cris_import container's shell
$ ./cris_import.py --csv --pdf --s3-download --s3-upload

Open your VZE and verify that the crashes list loads normally
Open a crash details page and verify that the crash diagram renders properly
The CR3 download button should be enabled, but clicking it will result in an error. This is a known issue, because the CR3 API is pointed at a different bucket
Scroll down to the change log and observe that the change log includes updates to cr3_processed_at and cr3_stored_fl fields
Head to the AWS console and locate the PDFs and diagrams in the S3 bucket: vision-zero-new-data-model-dev/dev/cr3s. Observe that the crash diagrams and PDFs have a Last modified timestamps that track with when you ran the import script.
Use your SQL client to query the cris_import_log table. Verify that there are new entries for each extract you processed

select * from cris_import_log order by id desc;

Local import

This will process the extract zips that were downloaded to your ./extracts directory during the previous step. CSVs will be loaded ino the db, and crash diagrams will be extracted but not uploaded to S3.

$ ./cris_import.py --csv --pdf

Archive and un-archive zips

The script can archive the extract zips by moving them from ./inbox to ./archive once they have been processed. This is intended for the production deployment, where the ./inobx functions as a work queue.

# download zips from ./inbox, process csvs, and archive the zips when done
$ ./cris_import.py --csv --s3-download --s3-archive

Restore the zips to the ./inbox using the helper script

$ python _restore_zips_from_archive.py

Other flags to test

# process any unzipped extract CSVs you have in your `./extracts` directory
$ ./cris_import.py --skip-unzip --csv

# process pdfs with more workers and in debug mode
$ ./cris_import.py --pdf --skip-unzip --s3-upload --workers 8 --verbose

johnclary commented 1 month ago

thanks again for the feedback. i am re-requesting review after having made the following changes:

modify CLI to require at least one of --csv or --pdf. running $ cris_import.py without any other args will throw an error 👍
expand the readme
raise an error if no pdfs are found
fix a bug where the created_by and updated_by columns were being removed and therefore not being set to cris
fix some typos

i'm going to leave the _column_metadata improvements for a separate PR.

roseeichelmann commented 1 month ago

i notice when you restore from archive the extracts get copied over to the inbox but stay in the archived folder too. i guess that makes sense as expected behavior but just wanted to point it out incase yall think it makes more sense to move them?

johnclary commented 3 weeks ago

i notice when you restore from archive the extracts get copied over to the inbox but stay in the archived folder too. i guess that makes sense as expected behavior but just wanted to point it out incase yall think it makes more sense to move them?

i did this partially out of laziness, but was also thinking it's probably nice to have that breadcrumb of seeing the extract in the ./archive and knowing it was processed. happy to change this though 🤷

johnclary commented 3 weeks ago

@roseeichelmann i am tracking a few follow-up todos for the cris_import_log. will not forget them! 🙏

cityofaustin / atd-vz-data