cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
48 stars 13 forks source link

airflow: operator and dag/tasks to sync NTD data via DOT API and XLSX #3415

Closed charlie-costanzo closed 2 months ago

charlie-costanzo commented 3 months ago

Description

This PR introduces new NTD data sources available through the federal Department of Transportation through their data API as well as XLSX file downloads.

Two Airflow operators were necessary for this work because although a large amount of NTD datasets are now available from the NTD API, there are still important datasets available only in XLSX format (monthly ridership, certain annual reports).

To accomplish this, two new Airflow operators (scrape_ntd_api.py and scrape_ntd_xlsx.py), two associated dags (sync_ntd_data_api and sync_ntd_data_xlsx), and a selection of NTD table endpoints as dag tasks were created.

Both operators utilize the PartitionedGCSArtifact class pattern used elsewhere in the pipeline.

NTD Data Sources scraped and stored in this PR include:

We discovered that these tables are retroactively updated at a regular cadence, including annual reports for previous years, so a schedule has been configured to download from these endpoints on the first day of the month, every month.

Resolves #3402, part of Epic #3401

Type of change

How has this been tested?

Successful local Airflow runs, publishing to gcs buckets

Post-merge follow-ups

vevetron commented 2 months ago

It's probably okay, but I'm not entirely sure I understand why we are building operators and airflow dags for some of these data entities such as "2022_reporting/2022_capital_expenses_by_mode.yml" - is there an expectation the data will change? Shouldn't it just be a one off data pull?

Edit: Actually i think they keep updating these 2022 datasets for some reason.