cityofaustin / atd-data-tech

Austin Transportation Data & Technology Services
17 stars 2 forks source link

Backfill CRIS data #16824

Closed johnclary closed 4 months ago

johnclary commented 5 months ago

Since we fixed the CRIS import, we are missing a few days' worth of crash updates. We need to backfill the DB with crashes processed since ~4/12. The last successful extract we processed was dated 4/13/2024, but we are not 100% sure what date range that extract covered.

johnclary commented 5 months ago

This is blocked until we regain Chia's CRIS access.

@frankhereford is going to prep our bastion host for running the CRIS import backfill. @mddilley will be on point for running the CRIS import from the bastion host once we regain CRIS access.

frankhereford commented 5 months ago

TLDR

Here are some instructions on using the bastion as a host to run the CRIS import in dev mode.

Intent

I got us setup with a shared checkout on the bastion. I've done it so that we don't have to use the root account to use it. This is accomplished using groups, since we're all members of the wheel group. This is one of those invisible, best practices type thing, since we never invoke root's permissions (except for using docker.) Anyway - that's neither here nor there. You can:

Instructions

ssh <username>@rds-bastion.austinmobility.io;
cd /opt/vision-zero/atd-vz-data/atd-etl/cris_import/development_extracts;

# copy in the extract zip archive into this folder
wormhole receive <whatever extract you want>; # or use scp or what-have-you. wormhole is available.

cd ..;
docker compose run cris-import;

special stuff for this install

no root anything except docker✨. I also checked to make sure that Mike had docker permissions. I can add anyone else who wants it, or you can add yourself to the docker group and restart your shell.

I have hard-coded a single configuration in the ETL because we are on the bastion host, and this ETL uses the bastion as a ssh relay. This just configures this process out of some sort of inception-style situation. You can see the change in the un-checked-in changes with git diff.

I have set the environment to be production so it connects on to the right DB instance and correct graphql-engine endpoints.

I've tested that it can connect to the DB and that it can access secrets. I've put the IP in the graphql-engine's load balancer security group that it needs, but I have not made any writing changes to the DB, so I don't have a great way to test that. Heads up if it can't insert things into the conflict resolution system .. I would look here.

cc/ @johnclary @mddilley @chiaberry

mddilley commented 5 months ago

thanks, Frank! I ssh'd to the bastion and tested everything short of transferring files and running the script. All works great! 🙏

mddilley commented 5 months ago

for the full story see this thread

in summary:

Next steps are to run the threaded CR3 downloader (~in progress~ done ✅), run the thread CR3 metadata script (todo), and figure out what to do to catch up on OCR.

mddilley commented 5 months ago

I'm keeping this in progress while we figure out what is going on with the CRIS extract deliveries.

Slack threads: 4/17/2024 4/18/2024 4/19/2024 4/22/2024

In summary, the SFTP delivery pipeline seems broken so I'm requesting extracts using the Download option. We can download these locally and then either copy them to the SFTP folder and trigger the DAG or run them locally. I'm going to go the DAG route so we capture logs.

Cleanup tasks

mddilley commented 4 months ago

@johnclary To close this one out, I wrote up some basic docs about the CRIS request process, and @frankhereford gave some great feedback on how to transfer files to the SFTP when we need to run manually. Here is the thread, and it sounds like you've already been using this method. I like the idea a lot!