cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
47 stars 12 forks source link

break apart transform_warehouse DAG to better reflect cadence needs #3291

Open charlie-costanzo opened 6 months ago

charlie-costanzo commented 6 months ago

After Littlepay's recent adjustment to their publishing cadence to better suit our analytics needs, we found that the new publishing time was too late for our transform_warehouse DAG start time and was making data stale. In #3290, we move the transform_warehouse DAG start time forward 4 hours ( from 10:00 to 14:00 UTC) to improve the data freshness, but this makes all data transformations happen later in the morning which is not ideal.

We need to break apart the transform_warehouse DAG so that models that need to be run later in the morning (payments) are run at 14:00, and all of the other models run at the previous time (10:00 UTC).

A notes doc for an initial meeting about this effort is available here, but the project was deprioritized in favor of handoff tasks following that first meeting.

vevetron commented 5 months ago

Larger Job overview: Break up jobs into buckets:

Harder

Breaking up of tasks is not a big deal -> Create new transform models Modify the daily dag, on the whole,run turn into multiple transform tasks -> then sequential rather than simultaneous