cal-itp / reports

GTFS data quality reports for California transit providers
https://reports.calitp.org
GNU Affero General Public License v3.0
7 stars 0 forks source link

Exploratory data analysis for reports site data model: Vendor information & organizations/services/datasets #194

Open lauriemerrell opened 1 year ago

lauriemerrell commented 1 year ago

Before tackling #181 and #193, we need to investigate the state of the Airtable data model with respect to the fields that we want to incorporate into the reports site to get a sense for data completeness, up-to-dateness, and any risks of fanout (many-to-many relationships).

Specifically, we want to investigate:

Submitting this ticket in the reports repo because it is directly associated with planned feature & data development on the reports site, but these questions are probably of broader interest as well.

edasmalchi commented 1 year ago

self-assigned per convo today w/ @tiffanychu90, going to prioritize this since it's been blocking us for a while

edasmalchi commented 1 year ago

What distinct products are listed for GTFS generation? Do those products all seem reasonable? (Perhaps can run them by folks within Cal-ITP)

It appears that there are multiple component values relevant in some way to either the publishing of GTFS or GTFS-RT. Right now the table is set to manually code them in order to support a simplified presentation of schedule and RT vendors on the reports site. For these components, the listed products appear reasonable. For the site, I propose reporting and organizing on the vendor level rather than the product level, with the exception of in-house and pending product values.

        CASE
            -- manually coded to relevance to GTFS Schedule and RT
            WHEN component_name IN ('GTFS generation', 'Scheduling (Fixed-route)')
                THEN 'schedule_vendors'
            WHEN component_name IN ('Real-time info', 'Arrival predictions',
             'GTFS-rt vehicles/trips', 'GTFS Alerts Publication')
                THEN 'rt_vendors'
        END AS reports_vendor_type,

185 total organizations in March reports (date_start = '2023-02-01')

8 are missing at least one schedule vendor (or pending/in-house) 89 are missing at least one RT vendor (or pending/in-house)

In other words, we're missing schedule vendor info for about 4% of total orgs, and RT vendor info for 26% of orgs that have RT.

SQL for organizations with neither RT nor Schedule vendors:

SELECT * FROM
`cal-itp-data-infra-staging.eric_mart_gtfs_quality.idx_monthly_reports_site`
LEFT JOIN
`cal-itp-data-infra-staging.eric_mart_gtfs_quality.fct_monthly_reports_site_organization_gtfs_vendors`
USING (organization_name, organization_source_record_id, organization_itp_id, date_start)
WHERE date_start = '2023-02-01'
--GROUP BY rt_vendors IS NULL
AND (ARRAY_LENGTH(rt_vendors) = 0 OR rt_vendors IS NULL)
AND (ARRAY_LENGTH(schedule_vendors) = 0 OR schedule_vendors IS NULL)
LIMIT 1000
Screenshot 2023-03-10 at 13 09 34

There appears to be no product/component data in Airtable for these. Since in-house is a value entered for some other service components, we shouldn't assume that its done in-house. The long term path would be for the Transit Data Quality team to connect with these agencies and determine how to correctly track.

How many many-to-many relationships are there between services & GTFS datasets, i.e., cases where one service is split across 2 datasets or one dataset contains multiple services? How many of these latter cases exist where the services in question are associated with different organizations?

Proposing we put this on hold for now since service definitions may be changing soon. This situation does appear rare, and by keeping vendor information fairly general and aggregated at the organization level I think we can move forward.

Also see warehouse work at https://github.com/cal-itp/data-infra/pull/2374

More thoughts to come on how to operationalize.

edasmalchi commented 1 year ago

https://github.com/cal-itp/reports/issues/181 contemplates using vendor information to filter the monthly index on the reports site -- I think if an organization has multiple vendors then filtering the index to any combination of those vendors should show a link to their report