Data model: main collector branch

johnclary commented 3 months ago

This PR holds the main collector branch for the new data model.

Get it running

Start your local VZ using a production snapshot
from the ./atd-vzd directory, apply migrations and metadata:

$ hasura migrate apply
$ hasura metadata apply

Navigate to ./atd-etl/cris_import and run the CRIS import following the instruction in the ETL readme. There is an .env file available in 1password under Env file for the Vision Zero new data model CRIS import ETL

This command will process the extracts available in the ./dev directory in S3:

$ docker compose run cris_import
# from the cris_import container's shell
$ ./cris_import.py --s3-download --s3-upload --csv --pdf

To populate the VZ edits, navigate to ./atd-toolbox/data_model/python/populate_edits_sql and run the commands in each of their corresponding crashes, units, and people .sql files in order
Start the VZE and check it out
To run the Socrata export, locate the 1pass item called Env file for the Vision Zero new data model Socrata export ETL and save it in the ./atd-etl/socrata_export directory as .env.
Build and run the Socrata export. This may take up to 20 minutes.

# from ./atd-etl/socrata_export
$ docker compose run socrata_export
$ ./socrata_export.py --crashes --people

Start the VZV and check it out

johnclary commented 3 months ago

👋 Thanks for reviewing this PR. Heads-up that I will be working on down migrations right away, since this wasn't necessary up until just now, when I moved everything into the public schema.

For time being, please focus your review attention on the conventions, assumptions, and functionality being put forward by this branch that may be at odds with our vision for the data model, or will be difficult to undo down the road. That said, no detail is too small :)

johnclary commented 3 months ago

also—sorry for more noise—I do not think there is any need to review the files in ./atd-toolbox at this time. the atd-vzd and atd-vze changes are what matters :)

roseeichelmann commented 3 months ago

This is suchhhh amazing work!!! I still have some more to dig into but here are some notes I have so far:

I like your handling of peh in the cris_import script and dropping the names if not a fatality. What about for updates, where inj_sev is updated to a fatality, then do we want to carry over the name? Or is that getting too complicated?
So great that you cleaned up the injury counts and now they are handled automatically in the crashes_list view by summing up the counts of injury severities from the person tables instead of the vz team having to keep everything up to date in multiple places. Now they only have to QA in one place, under the injury severity column of the persons table.
What about prefixing column names with vz_ if they are custom so we can quickly know if something is not from cris? That would help sooo much imo and i know we talked about doing something like that

johnclary commented 3 months ago

thanks @roseeichelmann for your review!

I like your handling of peh in the cris_import script and dropping the names if not a fatality. What about for updates, where inj_sev is updated to a fatality, then do we want to carry over the name? Or is that getting too complicated?

My thinking was that if a VZ user edits a person injury severity to a fatality, they will need to input the person's name by manually copying it from the CR3 pdf. If the injury severity change comes via CRIS import, then the name will be picked up in the ETL and correctly added to the person record ✅

So great that you cleaned up the injury counts and now they are handled automatically in the crashes_list view by summing up the counts of injury severities from the person tables instead of the vz team having to keep everything up to date in multiple places. Now they only have to QA in one place, under the injury severity column of the persons table.

Yeah! There are still some open questions about the best way to approach some of the metrics. And for that we have this WIP PR.

What about prefixing column names with vz_ if they are custom so we can quickly know if something is not from cris? That would help sooo much imo and i know we talked about doing something like that

Yes, I haven't forgotten about it this idea. I think I'm a little on the fence about because I'd like a more foolproof way for us to maintain our list of CRIS vs VZ columns. Maybe i'll open a little side branch just so we can see what this looks like.

The alternative i think still prefer is having a column_metadata table in the db where we track this info, plus other CRIS import and VZE-related metadata (e.g. we could track which columns are in the public vs private CRIS extract). This table would basically replace the function of the current google sheet 🤔

roseeichelmann commented 3 months ago

@johnclary I definitely think a column_metadata table is something we should have, but when im just scrolling through the database or a table i think i would find it helpful to see a column and know that its custom based off the name; im defs a fan of descriptive column names. curious what everyone else thinks! yeah maybe a spin off branch to test out how it feels could be good

roseeichelmann commented 3 months ago

Something else ive run into that i wanna mention - i still feel like the crash_id should be carried into the edits tables. i was looking for a unit by the id in the units_edits table and i found it unhelpful that the crash_id column was null because i was trying to validate that i was editing the right unit in this specific crash.. idk maybe im still thinking about all this wrong and in a crash id-centric way?

johnclary commented 3 days ago

Thanks y'all, here we go....

cityofaustin / atd-vz-data

Data model: main collector branch #1440

Get it running