cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
47 stars 12 forks source link

investigate loading validation (JSON) data into bigquery #59

Closed machow closed 3 years ago

machow commented 3 years ago

Right now the gtfs-validator returns a JSON payload. This notebook shows how it can be turned into a table, with 1 row per notice (see tidy_notice_details table), and either..

A big question though is how BigQuery likes to deal with these kinds of tables. AFAIK there are three potential options:

I'm guessing one of the first two makes most sense.

hunterowens commented 3 years ago

it's also more important to track validator results over time

so I think we might want something like

{each file} -> {check if same SHAHASH of zip compared to prior day run ->if new GTFS , save validator with run ID

machow commented 3 years ago

Ah--that's helpful to hear! Maybe a place to start is a validator_latest table, and then a gtfs_schedule_change table with one row per agency x change in zip (that could later be unpacked into more details).

(I can run checks on data changes in notebooks first to get a feel for how often it's changing, etc..)

machow commented 3 years ago

Alright, maybe there are two stages to do this in. Will edit in a bit more detail, but wanted to put here for now.

1. low-hanging fruit

Questions answered:

Technical:

2. full table history

Questions answered:

Technical:

hunterowens commented 3 years ago

@machow let's close this?