Closed charlie-costanzo closed 2 months ago
It's probably okay, but I'm not entirely sure I understand why we are building operators and airflow dags for some of these data entities such as "2022_reporting/2022_capital_expenses_by_mode.yml" - is there an expectation the data will change? Shouldn't it just be a one off data pull?
Edit: Actually i think they keep updating these 2022 datasets for some reason.
Description
This PR introduces new NTD data sources available through the federal Department of Transportation through their data API as well as XLSX file downloads.
Two Airflow operators were necessary for this work because although a large amount of NTD datasets are now available from the NTD API, there are still important datasets available only in XLSX format (monthly ridership, certain annual reports).
To accomplish this, two new Airflow operators (
scrape_ntd_api.py
andscrape_ntd_xlsx.py
), two associated dags (sync_ntd_data_api
andsync_ntd_data_xlsx
), and a selection of NTD table endpoints as dag tasks were created.Both operators utilize the
PartitionedGCSArtifact
class pattern used elsewhere in the pipeline.NTD Data Sources scraped and stored in this PR include:
We discovered that these tables are retroactively updated at a regular cadence, including annual reports for previous years, so a schedule has been configured to download from these endpoints on the first day of the month, every month.
Resolves #3402, part of Epic #3401
Type of change
How has this been tested?
Successful local Airflow runs, publishing to gcs buckets
Post-merge follow-ups