Description

This PR introduces new NTD data sources available through the federal Department of Transportation through their data API as well as XLSX file downloads.

Two Airflow operators were necessary for this work because although a large amount of NTD datasets are now available from the NTD API, there are still important datasets available only in XLSX format (monthly ridership, certain annual reports).

To accomplish this, two new Airflow operators (scrape_ntd_api.py and scrape_ntd_xlsx.py), two associated dags (sync_ntd_data_api and sync_ntd_data_xlsx), and a selection of NTD table endpoints as dag tasks were created.

Both operators utilize the PartitionedGCSArtifact class pattern used elsewhere in the pipeline.

NTD Data Sources scraped and stored in this PR include:

2022 Annual Reporting
Monthly Ridership Data
Safety, service, and security related data

We discovered that these tables are retroactively updated at a regular cadence, including annual reports for previous years, so a schedule has been configured to download from these endpoints on the first day of the month, every month.

Resolves #3402, part of Epic #3401

Type of change

[x] New feature

How has this been tested?

Successful local Airflow runs, publishing to gcs buckets

Post-merge follow-ups

[x] Environment variables need to be added to composer
[x] DAGs need to be manually triggered
[ ] observe to verify expected behavior
[ ] create exception handling follow-on ticket

cal-itp / data-infra

airflow: operator and dag/tasks to sync NTD data via DOT API and XLSX #3415

Description

Type of change

How has this been tested?

Post-merge follow-ups