cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
47 stars 12 forks source link

Track / load remaining files included in GTFS schedule spec #277

Closed machow closed 2 years ago

machow commented 3 years ago

I'm not sure if these were missed when I was loading 6 months ago, or were recently added to the docs, but there are 3 files we currently don't load into the warehouse.

It looks like relatively few feeds use these files right now (but analyzing e.g. pathways will likely be increasingly important).

image

In order to load these, we need to...

Completing this would resolve https://github.com/cal-itp/data-infra/issues/180

machow commented 3 years ago

Feel free to move all the file tracking tasks into their own dag (that depends on past)

image

helpful notes from Chris--there are 4 jobs being done here (should also have 4 dags):

  1. put into cloud storage
  2. understand files (and when they change)
  3. load feeds with changed files into external tables
  4. understand what has changed in each table (e.g. build SCD tables)
machow commented 2 years ago

@holly-g this would have gone to Chris C, so will have to wait until Jarvus brings in a data engineer (or I pick it up)

machow commented 2 years ago

TO document: