Closed frankhereford closed 2 months ago
Thanks for the extra day on this y'all after I realized how much better it could be done based on @johnclary's spreadsheet instead of an overly general way. This new program unrolls a lot of the loops and allows for little conditions and handling of fiddly bits of data in the various tables. The four functions, one for each table being processed, are very similar, but different in tiny ways to handle the idiosyncratic nature of each conversion.
Thanks!
Ooops, I deleted this PR's base. Sorry! I have opened https://github.com/cityofaustin/atd-vz-data/pull/1476 as a replacement.
Associated issues
This PR hopes to close https://github.com/cityofaustin/atd-data-tech/issues/17207.
Items to touch on in data model sync
prsn_death_date
andprsn_death_time
fields ✅A note on speed and memory
This program, just by the nature of the issue it addresses, is going to compare an enormous number of data points. Previously, all the subject data was loaded into memory, which offered a huge speedup by eliminating any DB processing and IO latency at the expense of memory usage. This ended up not being workable on all of our dev machines, so I have refactored the script to use the
shelve
python library, which is part of the python distribution. This library offers a dictionary-like, file-backed persistent object.Testing
Local testing for this one, but we'll be using real data. Please let me know if I can help with any of this. I've put some effort into trying to make this these instructions as complete as possible, but it's always possible that I've overlooked something. It's reasonable to expect this whole process to take an hour and some, but much of it is waiting for long-running programs to complete.
Unmixing the paint
Hard-mode with all the PDFs which we don't need (now):
find . -type f -name "*.csv" -exec cp {} /wherever_you_have_your_checkout/atd-vz-data/atd-toolbox/data_model/python/cris_csvs \;
Grab the latest production data and populate your local DB
Apply migrations & metadata
We're going to populate the CRIS side of the data model now with John's script
Next, copy the
env_template
toenv
. The "secret" in it is actually more of a configuration item, so you'll find the value is already populated ready for local testing.Let's get the docker image fired up
We've done all the prep work to run the program, and optionally, you can now snap a backup of the DB right now, so you can come back to the point in your testing if you want.
Here's a good time to put in some artificial VZ data which we can look for later. You can do this or any other testing changes you want with your DB client.
Now we're going to unmix the paint
And here we can look at the new data model to see if our changes show up where we expect them.
Additionally, here is a query you can use to inspect the various
_edits
tables to look for patterns in the overridden data. At first blush, it looks good with the possible exception of how we're overriding a lot of names using data from the existing VZ data. I suspect that the program may need to be enhanced to build out some more nuanced rules around what names we're intending to pull in.select
query clauses (and/or thejoin
s) to get a bead on what the "edited" data looks like in tabular format.Ship list