CityofToronto / bdit_flashcrow

Working repository for MOVE, a project to modernize transportation data systems at the City of Toronto.
MIT License
10 stars 2 forks source link

Update DAG scheduling to fix ordering issues, reduce staleness #1062

Closed candu closed 2 years ago

candu commented 3 years ago

Description In this Notion page (internal-only), we document a recent review of our Airflow pipelines. As part of that, we discovered that some DAGs are running out of order, and that our DAG scheduling could be improved to reduce data staleness.

This task updates our schedules accordingly. In the process, you'll learn a bit about replicator and etl, as well as about all the data pipelines that we use to manage data in MOVE.

Acceptance Criteria

Additional Notes See Job Dependencies and Scheduling (internal-only) for more details.

Note that replicator-local-CRASH and replicator_transfer_crash are two different things! The former runs in replicator, the latter runs on etl under Airflow. Make sure you're updating the right one!

Note also that replicator jobs use Windows Task Scheduler / PowerShell syntax for scheduling. See replicator-register-jobs.ps1 for how that works.

Note that as you turn DAGs on, the new schedule should immediately trigger a run. As such, turn them on one at a time, in order as listed on that Notion page, and wait for each to complete before continuing. (This may take a while; it's good to have small tasks that you can complete while waiting!)

Finally: note that this issue only covers the AWS dev etl upgrade. You'll also have to deploy those changes to QA, as well as to prod with the next release. (You can, however, mark this closed once the dev upgrade is complete.)

candu commented 3 years ago

Assigning @peter-lyons to this - as mentioned, it's a good task for getting the bdit_move_etl repo set up, for seeing what data pipelines we have right now, and for diving into the replicator part of MOVE.

mkewins commented 2 years ago

@peter-lyons did some work to update the DAG scheduling and dependencies, and introduce DAG task groups for Airflow 2.x (and make these backwards compatible, IIRC!). These changes are pending release, but closing out this issue since it's been migrated to our Notion (internal link only).