Open lauriemerrell opened 1 year ago
self-assigned per convo today w/ @tiffanychu90, going to prioritize this since it's been blocking us for a while
What distinct products are listed for GTFS generation? Do those products all seem reasonable? (Perhaps can run them by folks within Cal-ITP)
It appears that there are multiple component values relevant in some way to either the publishing of GTFS or GTFS-RT. Right now the table is set to manually code them in order to support a simplified presentation of schedule and RT vendors on the reports site. For these components, the listed products appear reasonable. For the site, I propose reporting and organizing on the vendor level rather than the product level, with the exception of in-house and pending product values.
CASE
-- manually coded to relevance to GTFS Schedule and RT
WHEN component_name IN ('GTFS generation', 'Scheduling (Fixed-route)')
THEN 'schedule_vendors'
WHEN component_name IN ('Real-time info', 'Arrival predictions',
'GTFS-rt vehicles/trips', 'GTFS Alerts Publication')
THEN 'rt_vendors'
END AS reports_vendor_type,
185 total organizations in March reports (date_start = '2023-02-01')
8 are missing at least one schedule vendor (or pending/in-house) 89 are missing at least one RT vendor (or pending/in-house)
In other words, we're missing schedule vendor info for about 4% of total orgs, and RT vendor info for 26% of orgs that have RT.
SQL for organizations with neither RT nor Schedule vendors:
SELECT * FROM
`cal-itp-data-infra-staging.eric_mart_gtfs_quality.idx_monthly_reports_site`
LEFT JOIN
`cal-itp-data-infra-staging.eric_mart_gtfs_quality.fct_monthly_reports_site_organization_gtfs_vendors`
USING (organization_name, organization_source_record_id, organization_itp_id, date_start)
WHERE date_start = '2023-02-01'
--GROUP BY rt_vendors IS NULL
AND (ARRAY_LENGTH(rt_vendors) = 0 OR rt_vendors IS NULL)
AND (ARRAY_LENGTH(schedule_vendors) = 0 OR schedule_vendors IS NULL)
LIMIT 1000
There appears to be no product/component data in Airtable for these. Since in-house is a value entered for some other service components, we shouldn't assume that its done in-house. The long term path would be for the Transit Data Quality team to connect with these agencies and determine how to correctly track.
How many many-to-many relationships are there between services & GTFS datasets, i.e., cases where one service is split across 2 datasets or one dataset contains multiple services? How many of these latter cases exist where the services in question are associated with different organizations?
Proposing we put this on hold for now since service definitions may be changing soon. This situation does appear rare, and by keeping vendor information fairly general and aggregated at the organization level I think we can move forward.
Also see warehouse work at https://github.com/cal-itp/data-infra/pull/2374
More thoughts to come on how to operationalize.
https://github.com/cal-itp/reports/issues/181 contemplates using vendor information to filter the monthly index on the reports site -- I think if an organization has multiple vendors then filtering the index to any combination of those vendors should show a link to their report
Before tackling #181 and #193, we need to investigate the state of the Airtable data model with respect to the fields that we want to incorporate into the reports site to get a sense for data completeness, up-to-dateness, and any risks of fanout (many-to-many relationships).
Specifically, we want to investigate:
dim_services
,dim_service_components
, anddim_components
tables)Submitting this ticket in the reports repo because it is directly associated with planned feature & data development on the reports site, but these questions are probably of broader interest as well.