What problem are we trying to solve?
We want users to be able to easily search for data by location and provider.
How will we know when this is done?
The JSON schema is updated to match the following field breakdowns.
The CSV artifact is updated to match the fields.
Field Name
Required from users
Definition
MDB Source ID
No - system generated
Unique identifier following the structure: mdbsrc-provider-subdivisionname-countrycode-numericalid. 3 character minimum and 63 character maximum based on Google Cloud Storage.
Data Type
Yes
The data format that the source uses, e.g GTFS, GTFS-RT.
Country Code
Yes
ISO 3166-1 alpha-2 code designating the country where the system is located. For a list of valid codes see here.
Subdivision name
Yes
ISO 3166-2 subdivision name designating the subdivision (e.g province, state, region) where the system is located. For a list of valid names see here.
Municipality
Yes
Primary municipality in which the transit system is located.
Provider
Yes
Name of the transit provider.
Name
Optional
An optional description of the data source, e.g to specify if the data source is an aggregate of multiple providers, or which network is represented by the source.
Auto-Discovery URL
Yes
URL that automatically opens the source.
Latest dataset URL
No - system generated
A stable URL for the latest dataset of a source.
License URL
Optional
The transit provider’s license information.
Bounding box
No - system generated
This is the bounding box of the data source when it was first added to the catalog. It includes the date and timestamp the bounding box was extracted in UTC.
Considerations for data model (based on our experience in doing this for all of California)
Critical Items
Definition of "transit provider" name: we use the legal name of the parent organization...which is often a City/County, a JPA, or an independent transit district. Happy to provide you the list in CA!
Enumerated of "transit providers" to avoid duplication
Array of aliases for the transit provider (i.e. "SFMTA", "Muni".....or "LA Metro".....or "AC Transit" to enable searching by common and/or brand names.
DataType: should be further broken down from GTFS-RT into the type of GTFS-RT (TripUpdates, etc)
"Primary Municipality" is moot or ambiguous for quite a few systems...i.e. what is the primary municipality for Caltrain? Capitol Corridor? Amtrak? Don't require it or make it less ambiguous by having the "headquarters municipality or [census ]designated place"
Common template for URI's with API keys, etc.
Desired items
Dataset owner (Organization): who has the rights here?
Dataset publisher (Organization + primary individual): who is responsible for web/access issues?
Dataset maintainer (Organization + primary individual): who is responsible for data issues?
Enumerated list of Services contained in dataset (ideally identified by agency_name but...)
Designation of some sort of "priority" for this dataset. i.e. does the AC Transit GTFS Feed take priority over MTC 511 feed?
Referenced datasets: in particular for RT...which GTFS static source does it build on?
@e-lo Thank you for the in-depth feedback! Let me know if you have any additional questions or concerns based on this response:
Answers
Not sure why stable vs auto-discovery URL would be different? What use case does this satisfy?
You’re correct, they are the same thing. We used auto-discovery URL as a term based on using GBFS’ systems.csv as inspiration. However, upon review it’s clear that discovery isn’t a meaningful term in GTFS and it should be changed. Our plan with this issue is to modify the auto-discovery URL to be direct download URL. The main reason we don’t plan to use stable URL is that oftentimes the URL provided from data publishers isn’t in fact stable (time bound, not an official source, etc).
This is marked as done, but I don't see a PR attached?
Originally there wasn’t a PR because the prototype PR was extremely large and attached to another issue. This has been fixed.
Critical Items
Primary municipality: We’re going to make both municipality and subdivision optional based on this feedback and after looking more closely at different source examples, and seeing there are many aggregate feeds and larger transit systems for which neither apply.
DataType: Currently within each GTFS Realtime source, there are three fields for Trip Updates, Service Alerts, and Vehicle Positions. This ensures that the user can get all the information they want under one GTFS Realtime file. Previously, it was complex for one to search and collect GTFS Realtime information using Transitfeeds. Now everything will be under one single file.
Could you elaborate on what use case needs a common template for URIs with API keys? Is this to standardize how we indicate an API key is needed within a URL?
Transit provider definition, enumerated list, and array of aliases: Thanks for sharing a suggested structure for how we could provide a catalog of organizations and services in the working document. I’ve added a feature in the roadmap for expanding the catalogs that the community can vote on. (I’ve used some of the user stories you suggested for the search interface here since I believe this feature would address similar needs).
Upon further consideration, we think it makes sense to use “agency” as a starting point rather than transit provider, since
Agency can be defined as the agency name provided in agency.txt, and is discoverable within GTFS
Additional information needs to be provided alongside transit provider in order to enhance searchability (like aliases, or its associated services and brands, which likely make sense in a separate catalog)
For the purposes of launching V1 on the 23rd, we’ll be making this modification to agency in the
schema. We’ll consider making the other transit provider related changes as part of V2 in Q2, and ask about the community’s priorities during our technical presentation on April 13th.
Desired Items
A few clarifying questions/comments based on our internal team review:
Who would define the priority of the dataset in the case of different data publishers and aggregate sources?
Including primary individual may be difficult to keep up-to-date, but we could include generic contact information for the corresponding organization.
Referenced datasets in included in the current GTFS Realtime schema as “static reference”.
DataType: Currently within each GTFS Realtime source, there are three fields for Trip Updates, Service Alerts, and Vehicle Positions. This ensures that the user can get all the information they want under one GTFS Realtime file. Previously, it was complex for one to search and collect GTFS Realtime information using Transitfeeds. Now everything will be under one single file.
I agree that the user experience should be able to get all the realtime feeds with a single query, but that doesn't necessitate the data model do that as well. There are providers which have several realtime feeds of the same type (particularly for contracted service) and some which duplicative or enhanced feeds – so the desired user experience will still require the API (or whatever level of obfuscation) to query and assemble feeds from multiple entries.
Since the URLs are each optional, it effectively allows you to have an entry for each RT data type...but I do want to make sure the user experience isn't overly dependent on this structure.
Could you elaborate on what use case needs a common template for URIs with API keys? Is this to standardize how we indicate an API key is needed within a URL?
Upon further consideration, we think it makes sense to use “agency” as a starting point rather than transit provider, since
Agency can be defined as the agency name provided in agency.txt, and is discoverable within GTFS
I actually think that overlap with agency.txt is actually a good reason not to use agency. The definition of an agency in agency.txt is actually a brand not an actual agency. This is confusing enough to explain and correct to transit providers and GTFS users that I would really love for us not to misuse the term yet again in a different (but also not accurate) context.
Upon further consideration, we think it makes sense to use “agency” as a starting point rather than transit provider, since
Additional information needs to be provided alongside transit provider in order to enhance searchability (like aliases, or its associated services and brands, which likely make sense in a separate catalog)
You could alternatively use a "common name" as the "transit provider name" and then in a future catalog of transit providers add in "official organization name".
Who would define the priority of the dataset in the case of different data publishers and aggregate sources?
This is really a question about an overall governance model – but ideally any changes to this priority in a PR would flag staff at the transit provider to review and disagree with.
Re: GTFS Realtime: so you're suggesting a structure where each realtime link is its own source entry, and each gtfs realtime data type can be added to data type?
Thanks for clarifying. This week we'll add a standard API key structure for APIs that do authorization in their URL.
Based on this feedback, we'll proceed with transit provider rather than agency and use the "common name" definition until we add the providers catalog.
Re: GTFS Realtime: so you're suggesting a structure where each realtime link is its own source entry, and each gtfs realtime data type can be added to data type?
I think this has the maximum flexibility and search ability. Again - happy to hear reasoning for alternative that meet the needs/situations described above.
I mainly just don't want to oversimplify the data model and then have a bunch of technical debt if/when it needs to be updated based on cases we already know exist in some significant number...
@e-lo Thanks for clarifying. Over the past week, we've heard some concern from a consumer perspective with the realtime feed information for one provider living in multiple sources, making the info more difficult to search and parse. Could you provide 2-3 examples of this use case with multiple realtime feeds of the same type so we could consider how to model it? We're considering making the static reference field a list, and the URLs nested so multiple URLs of the same type could be included in the same source file.
Since these discussions are still ongoing, and we agree that we want to avoid considerable technical debt, we plan to delay importing the realtime data until later in Q2. The release plan will be reflected to include this update.
We're considering making the static reference field a list, and the URLs nested so multiple URLs of the same type could be included in the same source file.
If this happens, the list should be an object such that it can be individually queried/filtered for the following use cases (which could end up adding complexity depending on how implemented):
Multiple feed versions, only one works with Realtime
Transit Provider X has three published GTFS datasets, but only one "syncs" with their realtime feeds. In order to link their realtime feed with the correct static feed, I need to reference a specific schedule dataset.
There are lots of examples here (69 in our current data for California), including all Bay Area datasets, Victor Valley, Tulare, Thousand Oaks, Simi Valley, Santa Ynez, Ojai, Sacramento, Gold Coast, Glenn, etc.
In many (not all) of these cases this is caused because there is a CAD/AVL/Realtime service provider which needs to update the static dataset in order to publish a static dataset which is consistent with realtime –this most often occurs when there is a combination of services with the same realtime feed and naming conflicts need to be avoided, such as in the Bay Area and Ventura County which produce a single set of combined realtime feeds.
Multiple datasets from different published URLs come together to produce a complete schedule.
Transit providers sometimes need to publish different services in separate GTFS Schedule datasets for various reasons such as contracted service agreements (e.g.Visalia and V-Line) and feed size (e.g. LA Metro). In other cases, providing certain variables in a query to to a GTFS Schedule API will yield different services (e.g. Bay Area 511). In all cases, we likely need to know which combination of feeds produce the entirety of service.
Feeds which contain services represented in other feeds
In some cases transit providers publish data on supporting services which aren't directly managed by them and overlap with the transit provider's GTFS Schedule dataset which provides them. As a data user, I need to understand which parts of the dataset contain duplicates of service which should be screened out, deferring to a separate feed for the information that the transit provider which manages that service wants me to see.
For example, the Amtrak Schedule Dataset (whoot!) contains many supportive services such as the Altamont Corridor Express (ACE). ACE is also included in Bay Area 511 among other feeds. As a data consumer, I'd like to know which GTFS Schedule Dataset I should consume ACE information from, from the transit provider's perspective (if possible)
Over the past week, we've heard some concern from a consumer perspective with the realtime feed information for one provider living in multiple sources, making the info more difficult to search and parse.
From a transparency perspective, it would be great to have the use cases and discussion from the transit consumers here in this issue.
(Note: I'm definitely not doubting that there are very valid and important issues...I'd just prefer if we could all discuss in one place that is traceable)
Could you provide 2-3 examples of this use case with multiple realtime feeds of the same type so we could consider how to model it?
Offhand I can think of the following cases:
Republishing combined feeds
I think the biggest example is in the Bay Area, where most transit providers have their own realtime API but also have the combined Bay Area 511 API.
Adding coverage with a low-cost option
Another example would be for transit providers that we are trying to add route coverage for with our GTFS Realtime as a Service (GRaaS) product where not all of their current services have Realtime capabilities - so there is a separate URL for them. Some of these are small (Desert Roadrunner, Tulare, etc.) but others are big and important (Clean Air Express, Amtrak Thruway).
Contracted service
One of the biggest issues we've seen is that a portion of a transit provider's service may be operated by a contractor and is often not integrated into the transit provider's business processes/technology in the same way. We are working on daylighting the realtime data for all of these services, but the easiest path to this is thru a separate publishing process. Some examples (not live yet, but we hope will be eventually) include:
Could you provide 2-3 examples of this use case with multiple realtime feeds of the same type so we could consider how to model it?
HART is one such case here in Tampa, FL. They have a single GTFS dataset that covers their bus and streetcar. Bus originally had RT data (OrbCAD system, and we at USF built a GTFS Realtime exporter for it), but streetcar did not (streetcar was a separately managed system). RT was added to streetcar via Swiftly.
So the resulting system has a single GTFS, but two GTFS Realtime endpoints for TripUpdates.
To model these cases, my preference would be to see something like this (URLs aren't real here, as I'm not sure if the streetcar URL is public):
This allows us to model many attributes for each endpoint as needed, but still keeps the endpoints logically grouped under the same provider.
The authentication_type, authentication_info_url, and api_key_parameter_name parameters are taken from this discussion of extending GTFS with links to RT feeds:
https://github.com/google/transit/pull/93
Note the API key structure in the streetcar URL. This will be harder to model in a directory than a simple URL parameter because it's integrated into the URL itself, which is why I've assigned a "authentication_type": 1 (ad-hoc) based on the current definitions in https://github.com/google/transit/pull/93. We could try to model this with a placeholder value that could be defined, which the consumer could replace with the actual API key.
Something like:
authentication_type3 = A placeholder text value is provided within the URL, provided in the field api_key_placeholder_name. Consumers should replace the text api_key_placeholder_name
api_key_url_placeholder_name = A text value that appears in the url field that should be replaced by the consumer with the actual API key. Required if authentication_type is 3.
Over the past week, we've heard some concern from a consumer perspective with the realtime feed information for one provider living in multiple sources, making the info more difficult to search and parse.
From a transparency perspective, it would be great to have the use cases and discussion from the transit consumers here in this issue.
(Note: I'm definitely not doubting that there are very valid and important issues...I'd just prefer if we could all discuss in one place that is traceable)
@barbeau was who we were discussing this with previously so the relevant use cases so far have been mentioned now.
Thanks to both of you for the above use cases and suggested approach going forward. I'm going to share this with the MobilityData team internally over the next few weeks after our quarterly planning process and get back to you with any relevant changes and how it'll accommodate the use cases you've provided. Let me know if you have any questions or concerns.
Looking at the (very) lengthy filenames that are now in the catalog, I'm wondering about the use of subdivision name as opposed to subdivision code - both of which are defined in the ISO table. We use country code not name, why not be consistent?
This issue still hasn't been resolved in the JSON schema. There are many important feeds with multiple transit providers.
Agreed, this issue hasn’t been resolved. Until we provide a catalog of organizations and providers, it’s unclear on our side how we could best achieve this enumerated list. Is there a lighter weight solution you're envisioning?
Looking at the (very) lengthy filenames that are now in the catalog, I'm wondering about the use of subdivision name as opposed to subdivision code - both of which are defined in the ISO table. We use country code not name, why not be consistent?
The original rationale behind this was around ease of entry and search - we didn’t want to require users to input the subdivision code name or search for it in instances where it isn't commonly used. However, it would make sense for us to alter the implementation of the file name at the bare minimum so they’re less lengthy (issue added here).
I agree with @e-lo that these complex use cases of multiple RT feeds referring to one Schedule feed (and vice versa) has not been fully represented in the current schema. Perhaps one lightweight and interim approach to these challenging use cases is to add a note field that can be a place to explain these situations. I think it does make sense to discuss this no later than when the catalog of organizations and providers item is discussed.
I think the one RT feed to many static feeds could be represented by making the mdb_source_id and static_reference elements arrays instead of single values, like:
@barbeau I think the schema here you suggested is great.
Here, why would we need a mdb_source_id to be represented by an array instead of a single value? In the example, would it be that mdb_source_id = 100 is related to static_reference = 120 only for instance, but that mdb_source_id = 100 and mdb_source_id = 101 share the same provider?
You only need the mdb_source_id to be an array if you have the scenario where you need to map a GTFS RT feed to more than one source. So it really depends on your definition of "source". If you don't have this case, then a single value of mdb_source_id would be sufficient to map a GTFS RT feed back to a single source.
So, for example, if MTA Transit Bus is represented as one source with multiple GTFS static files (Brox, Brooklyn, Manhattan, Queens, Staten Island), then you could have a single GTFS RT record with a single mdb_source_id but multiple static_reference to link it back to the static sources.
If you wanted to treat MTA Brox as it's own source, then you'd need an array for mdb_source_id to reference multiple sources from the MTA Transit Bus GTFS RT feed record.
Since the goal is to make it easier for consumers to see which GTFS schedule sources are tied to a realtime source, we think keeping mdb_source_id as 1 unique value and associating several static_reference values will be sufficient and less confusing.
What problem are we trying to solve? We want users to be able to easily search for data by location and provider.
How will we know when this is done?
The JSON schema is updated to match the following field breakdowns.
The CSV artifact is updated to match the fields.
Considerations for data model (based on our experience in doing this for all of California)
Critical Items
Desired items
agency_name
but...)future_url
: for validating forthcoming dataset updates (i.e. https://gitlab.com/LACMTA/gtfs_bus/-/blob/future-service/gtfs_bus.zip)Questions
@e-lo Thank you for the in-depth feedback! Let me know if you have any additional questions or concerns based on this response:
Answers
Not sure why stable vs auto-discovery URL would be different? What use case does this satisfy?
You’re correct, they are the same thing. We used auto-discovery URL as a term based on using GBFS’ systems.csv as inspiration. However, upon review it’s clear that discovery isn’t a meaningful term in GTFS and it should be changed. Our plan with this issue is to modify the auto-discovery URL to be direct download URL. The main reason we don’t plan to use stable URL is that oftentimes the URL provided from data publishers isn’t in fact stable (time bound, not an official source, etc).
This is marked as done, but I don't see a PR attached?
Originally there wasn’t a PR because the prototype PR was extremely large and attached to another issue. This has been fixed.
Critical Items
Primary municipality: We’re going to make both municipality and subdivision optional based on this feedback and after looking more closely at different source examples, and seeing there are many aggregate feeds and larger transit systems for which neither apply.
DataType: Currently within each GTFS Realtime source, there are three fields for Trip Updates, Service Alerts, and Vehicle Positions. This ensures that the user can get all the information they want under one GTFS Realtime file. Previously, it was complex for one to search and collect GTFS Realtime information using Transitfeeds. Now everything will be under one single file.
Could you elaborate on what use case needs a common template for URIs with API keys? Is this to standardize how we indicate an API key is needed within a URL?
Transit provider definition, enumerated list, and array of aliases: Thanks for sharing a suggested structure for how we could provide a catalog of organizations and services in the working document. I’ve added a feature in the roadmap for expanding the catalogs that the community can vote on. (I’ve used some of the user stories you suggested for the search interface here since I believe this feature would address similar needs).
Upon further consideration, we think it makes sense to use “agency” as a starting point rather than transit provider, since
For the purposes of launching V1 on the 23rd, we’ll be making this modification to agency in the schema. We’ll consider making the other transit provider related changes as part of V2 in Q2, and ask about the community’s priorities during our technical presentation on April 13th.
Desired Items
A few clarifying questions/comments based on our internal team review:
I agree that the user experience should be able to get all the realtime feeds with a single query, but that doesn't necessitate the data model do that as well. There are providers which have several realtime feeds of the same type (particularly for contracted service) and some which duplicative or enhanced feeds – so the desired user experience will still require the API (or whatever level of obfuscation) to query and assemble feeds from multiple entries.
Since the URLs are each optional, it effectively allows you to have an entry for each RT data type...but I do want to make sure the user experience isn't overly dependent on this structure.
Exactly. ie.
https://www.myawesometransit.gov/gtfs/gtfs-alerts?key={API_KEY}
I actually think that overlap with
agency.txt
is actually a good reason not to use agency. The definition of an agency inagency.txt
is actually a brand not an actual agency. This is confusing enough to explain and correct to transit providers and GTFS users that I would really love for us not to misuse the term yet again in a different (but also not accurate) context.You could alternatively use a "common name" as the "transit provider name" and then in a future catalog of transit providers add in "official organization name".
This is really a question about an overall governance model – but ideally any changes to this priority in a PR would flag staff at the transit provider to review and disagree with.
🙌
@e-lo:
I think this has the maximum flexibility and search ability. Again - happy to hear reasoning for alternative that meet the needs/situations described above.
I mainly just don't want to oversimplify the data model and then have a bunch of technical debt if/when it needs to be updated based on cases we already know exist in some significant number...
@e-lo Thanks for clarifying. Over the past week, we've heard some concern from a consumer perspective with the realtime feed information for one provider living in multiple sources, making the info more difficult to search and parse. Could you provide 2-3 examples of this use case with multiple realtime feeds of the same type so we could consider how to model it? We're considering making the static reference field a list, and the URLs nested so multiple URLs of the same type could be included in the same source file.
Since these discussions are still ongoing, and we agree that we want to avoid considerable technical debt, we plan to delay importing the realtime data until later in Q2. The release plan will be reflected to include this update.
If this happens, the list should be an object such that it can be individually queried/filtered for the following use cases (which could end up adding complexity depending on how implemented):
Transit Provider X has three published GTFS datasets, but only one "syncs" with their realtime feeds. In order to link their realtime feed with the correct static feed, I need to reference a specific schedule dataset.
There are lots of examples here (69 in our current data for California), including all Bay Area datasets, Victor Valley, Tulare, Thousand Oaks, Simi Valley, Santa Ynez, Ojai, Sacramento, Gold Coast, Glenn, etc.
In many (not all) of these cases this is caused because there is a CAD/AVL/Realtime service provider which needs to update the static dataset in order to publish a static dataset which is consistent with realtime –this most often occurs when there is a combination of services with the same realtime feed and naming conflicts need to be avoided, such as in the Bay Area and Ventura County which produce a single set of combined realtime feeds.
Transit providers sometimes need to publish different services in separate GTFS Schedule datasets for various reasons such as contracted service agreements (e.g.Visalia and V-Line) and feed size (e.g. LA Metro). In other cases, providing certain variables in a query to to a GTFS Schedule API will yield different services (e.g. Bay Area 511). In all cases, we likely need to know which combination of feeds produce the entirety of service.
In some cases transit providers publish data on supporting services which aren't directly managed by them and overlap with the transit provider's GTFS Schedule dataset which provides them. As a data user, I need to understand which parts of the dataset contain duplicates of service which should be screened out, deferring to a separate feed for the information that the transit provider which manages that service wants me to see.
For example, the Amtrak Schedule Dataset (whoot!) contains many supportive services such as the Altamont Corridor Express (ACE). ACE is also included in Bay Area 511 among other feeds. As a data consumer, I'd like to know which GTFS Schedule Dataset I should consume ACE information from, from the transit provider's perspective (if possible)
From a transparency perspective, it would be great to have the use cases and discussion from the transit consumers here in this issue.
(Note: I'm definitely not doubting that there are very valid and important issues...I'd just prefer if we could all discuss in one place that is traceable)
Offhand I can think of the following cases:
Republishing combined feeds I think the biggest example is in the Bay Area, where most transit providers have their own realtime API but also have the combined Bay Area 511 API.
Adding coverage with a low-cost option Another example would be for transit providers that we are trying to add route coverage for with our GTFS Realtime as a Service (GRaaS) product where not all of their current services have Realtime capabilities - so there is a separate URL for them. Some of these are small (Desert Roadrunner, Tulare, etc.) but others are big and important (Clean Air Express, Amtrak Thruway).
Contracted service One of the biggest issues we've seen is that a portion of a transit provider's service may be operated by a contractor and is often not integrated into the transit provider's business processes/technology in the same way. We are working on daylighting the realtime data for all of these services, but the easiest path to this is thru a separate publishing process. Some examples (not live yet, but we hope will be eventually) include:
HART is one such case here in Tampa, FL. They have a single GTFS dataset that covers their bus and streetcar. Bus originally had RT data (OrbCAD system, and we at USF built a GTFS Realtime exporter for it), but streetcar did not (streetcar was a separately managed system). RT was added to streetcar via Swiftly.
So the resulting system has a single GTFS, but two GTFS Realtime endpoints for TripUpdates.
To model these cases, my preference would be to see something like this (URLs aren't real here, as I'm not sure if the streetcar URL is public):
This allows us to model many attributes for each endpoint as needed, but still keeps the endpoints logically grouped under the same provider.
The
authentication_type
,authentication_info_url
, andapi_key_parameter_name
parameters are taken from this discussion of extending GTFS with links to RT feeds: https://github.com/google/transit/pull/93Note the API key structure in the streetcar URL. This will be harder to model in a directory than a simple URL parameter because it's integrated into the URL itself, which is why I've assigned a
"authentication_type": 1
(ad-hoc) based on the current definitions in https://github.com/google/transit/pull/93. We could try to model this with a placeholder value that could be defined, which the consumer could replace with the actual API key.Something like:
authentication_type
3
= A placeholder text value is provided within the URL, provided in the field api_key_placeholder_name. Consumers should replace the text api_key_placeholder_nameapi_key_url_placeholder_name
= A text value that appears in theurl
field that should be replaced by the consumer with the actual API key. Required ifauthentication_type
is 3.Actually, looking back at the GTFS linked datasets proposal, Swiftly commented here asking for another
authentication_type
: https://github.com/google/transit/pull/93#issuecomment-792891386Not sure if the streetcar URL format is an older or newer API key format for them since that comment.
I would generally prefer handlebar-like syntax with expected values.
@barbeau was who we were discussing this with previously so the relevant use cases so far have been mentioned now.
Thanks to both of you for the above use cases and suggested approach going forward. I'm going to share this with the MobilityData team internally over the next few weeks after our quarterly planning process and get back to you with any relevant changes and how it'll accommodate the use cases you've provided. Let me know if you have any questions or concerns.
This issue still hasn't been resolved in the JSON schema. There are many important feeds with multiple transit providers.
Looking at the (very) lengthy filenames that are now in the catalog, I'm wondering about the use of
subdivision name
as opposed tosubdivision code
- both of which are defined in the ISO table. We use country code not name, why not be consistent?Agreed, this issue hasn’t been resolved. Until we provide a catalog of organizations and providers, it’s unclear on our side how we could best achieve this enumerated list. Is there a lighter weight solution you're envisioning?
The original rationale behind this was around ease of entry and search - we didn’t want to require users to input the subdivision code name or search for it in instances where it isn't commonly used. However, it would make sense for us to alter the implementation of the file name at the bare minimum so they’re less lengthy (issue added here).
I agree with @e-lo that these complex use cases of multiple RT feeds referring to one Schedule feed (and vice versa) has not been fully represented in the current schema. Perhaps one lightweight and interim approach to these challenging use cases is to add a
note
field that can be a place to explain these situations. I think it does make sense to discuss this no later than when the catalog of organizations and providers item is discussed.@evansiroky @e-lo Looking back at https://github.com/MobilityData/mobility-database-catalogs/issues/36#issuecomment-1076734944, I think the one use case I didn't illustrate there is one RT feed to many static feeds - did I miss anything else?
I think the one RT feed to many static feeds could be represented by making the
mdb_source_id
andstatic_reference
elements arrays instead of single values, like:Do you know of any cases this doesn't cover, or reasons why this wouldn't work?
@barbeau I think this may cover most use cases that are present until the catalog of organizations and providers item is discussed.
@barbeau I think the schema here you suggested is great.
Here, why would we need a
mdb_source_id
to be represented by an array instead of a single value? In the example, would it be thatmdb_source_id = 100
is related tostatic_reference = 120
only for instance, but thatmdb_source_id = 100
andmdb_source_id = 101
share the same provider?You only need the
mdb_source_id
to be an array if you have the scenario where you need to map a GTFS RT feed to more than one source. So it really depends on your definition of "source". If you don't have this case, then a single value ofmdb_source_id
would be sufficient to map a GTFS RT feed back to a single source.I think MTA is a good test for this model: http://web.mta.info/developers/developer-data-terms.html#data
So, for example, if MTA Transit Bus is represented as one source with multiple GTFS static files (Brox, Brooklyn, Manhattan, Queens, Staten Island), then you could have a single GTFS RT record with a single
mdb_source_id
but multiplestatic_reference
to link it back to the static sources.If you wanted to treat MTA Brox as it's own source, then you'd need an array for
mdb_source_id
to reference multiple sources from the MTA Transit Bus GTFS RT feed record.Since the goal is to make it easier for consumers to see which GTFS schedule sources are tied to a realtime source, we think keeping
mdb_source_id
as 1 unique value and associating severalstatic_reference
values will be sufficient and less confusing.I've opened a new issue specifically focused on realtime changes to track our progress in updating the schema and an associated PR. The only notable changes from @barbeau 's original proposal are
mdb_source_id
1 value instead of an arraylicense
tolicense_url
to make it easier to discover information and match to GTFS schedule schemaurl
todirect_download_url
to match GTFS schedule schema and be clearer about URL purposenote
field as requested by @evansirokyPlease feel free to take a look and comment on the PR.
...deleted earlier one b/c now I realize that this was in reference to mbd_source_idm not the static one :-)
The realtime schema has been implemented! I've separated out the remainder of this big conversation into the following outstanding issues:
In order to make the discussion easier to follow in the future, I'm going to close this issue.