cagov / caldata-mdsa-caltrans-pems

CalData's MDSA project with Caltrans on Performance Measurement System (PeMS) data
https://cagov.github.io/caldata-mdsa-caltrans-pems/
MIT License
7 stars 0 forks source link

make clearinghouse load sooner #469

Closed JamesSLogan closed 1 week ago

JamesSLogan commented 1 week ago

Fixes #455 ^mostly fixes, at least. This change puts the s3 loading 2 hours prior to our nightly build, which will decrease daily lag by 1 day. Specifically, lag will go from 3 to 2 days. To get us to 1 day, we would need to load data as soon as it becomes available in clearinghouse.

Currently, "yesterday's" data arrives during the hours of 4:00 to 5:30 AM the next day. To schedule the current process of loading to s3, we would want to start it ~6-7 to be safe. This would lead to kicking off the nightly build around 7-8 AM, which is arguably too late to guarantee data availability before people start work, some days, at least. I think it's worth it to keep the current implementation, especially considering that the data relay server should improve this latency in the future.

@mmmiah fyi

JamesSLogan commented 1 week ago

Thanks for the feedback @ian-r-rose , it is now implemented. I believe the following manual actions will need to take place post-merge:

  1. Edit nightly build schedule (I can do this)
  2. Deploy dag to airflow (Ian?)
  3. Re-run dag for "yesterday" since we will be skipping a day when this is merged/deployed (Ian?)
ian-r-rose commented 1 week ago

Thanks for the feedback @ian-r-rose , it is now implemented. I believe the following manual actions will need to take place post-merge:

1. Edit nightly build schedule (I can do this)

2. Deploy dag to airflow (Ian?)

3. Re-run dag for "yesterday" since we will be skipping a day when this is merged/deployed (Ian?)

Yep, agreed on all points! I just need to remember how to deploy the dag :)

ian-r-rose commented 1 week ago

Update here @JamesSLogan, this is now deployed to Airflow, and we are now caught up (data from this morning is in Snowflake!)

Also, I made a mistake above. This script doesn't take 60 minutes, it takes 60 seconds. So actually I think it would be quite safe to schedule the "nightly" job for 6:30 AM.

JamesSLogan commented 1 week ago

Awesome, thank you! I re-updated the dbt job for 6:30. 🤞 for tomorrow's run