MobilityData / mobility-database-catalogs

The Catalogs of Sources of the Mobility Database.
Apache License 2.0
257 stars 51 forks source link

Finalize JSON schemas #36

Closed emmambd closed 2 years ago

emmambd commented 2 years ago

What problem are we trying to solve? We want users to be able to easily search for data by location and provider.

How will we know when this is done?

Field Name Required from users Definition
MDB Source ID No - system generated Unique identifier following the structure: mdbsrc-provider-subdivisionname-countrycode-numericalid. 3 character minimum and 63 character maximum based on Google Cloud Storage.
Data Type Yes The data format that the source uses, e.g GTFS, GTFS-RT.
Country Code Yes ISO 3166-1 alpha-2 code designating the country where the system is located. For a list of valid codes see here.
Subdivision name Yes ISO 3166-2 subdivision name designating the subdivision (e.g province, state, region) where the system is located. For a list of valid names see here.
Municipality Yes Primary municipality in which the transit system is located.
Provider Yes Name of the transit provider.
Name Optional An optional description of the data source, e.g to specify if the data source is an aggregate of multiple providers, or which network is represented by the source.
Auto-Discovery URL Yes URL that automatically opens the source.
Latest dataset URL No - system generated A stable URL for the latest dataset of a source.
License URL Optional The transit provider’s license information.
Bounding box No - system generated This is the bounding box of the data source when it was first added to the catalog. It includes the date and timestamp the bounding box was extracted in UTC.
e-lo commented 2 years ago

Considerations for data model (based on our experience in doing this for all of California)

Critical Items

  • Definition of "transit provider" name: we use the legal name of the parent organization...which is often a City/County, a JPA, or an independent transit district. Happy to provide you the list in CA!
  • Enumerated of "transit providers" to avoid duplication
  • Array of aliases for the transit provider (i.e. "SFMTA", "Muni".....or "LA Metro".....or "AC Transit" to enable searching by common and/or brand names.
  • DataType: should be further broken down from GTFS-RT into the type of GTFS-RT (TripUpdates, etc)
  • "Primary Municipality" is moot or ambiguous for quite a few systems...i.e. what is the primary municipality for Caltrain? Capitol Corridor? Amtrak? Don't require it or make it less ambiguous by having the "headquarters municipality or [census ]designated place"
  • Common template for URI's with API keys, etc.

Desired items

  • Dataset owner (Organization): who has the rights here?
  • Dataset publisher (Organization + primary individual): who is responsible for web/access issues?
  • Dataset maintainer (Organization + primary individual): who is responsible for data issues?
  • Enumerated list of Services contained in dataset (ideally identified by agency_name but...)
  • Designation of some sort of "priority" for this dataset. i.e. does the AC Transit GTFS Feed take priority over MTC 511 feed?
  • Referenced datasets: in particular for RT...which GTFS static source does it build on?
  • future_url: for validating forthcoming dataset updates (i.e. https://gitlab.com/LACMTA/gtfs_bus/-/blob/future-service/gtfs_bus.zip)

Questions

  • Not sure why stable vs auto-discovery URL would be different? What use case does this satisfy?
  • This is marked as done, but I don't see a PR attached?
emmambd commented 2 years ago

@e-lo Thank you for the in-depth feedback! Let me know if you have any additional questions or concerns based on this response:

Answers

Not sure why stable vs auto-discovery URL would be different? What use case does this satisfy?

You’re correct, they are the same thing. We used auto-discovery URL as a term based on using GBFS’ systems.csv as inspiration. However, upon review it’s clear that discovery isn’t a meaningful term in GTFS and it should be changed. Our plan with this issue is to modify the auto-discovery URL to be direct download URL. The main reason we don’t plan to use stable URL is that oftentimes the URL provided from data publishers isn’t in fact stable (time bound, not an official source, etc).

This is marked as done, but I don't see a PR attached?

Originally there wasn’t a PR because the prototype PR was extremely large and attached to another issue. This has been fixed.

Critical Items

Primary municipality: We’re going to make both municipality and subdivision optional based on this feedback and after looking more closely at different source examples, and seeing there are many aggregate feeds and larger transit systems for which neither apply.

DataType: Currently within each GTFS Realtime source, there are three fields for Trip Updates, Service Alerts, and Vehicle Positions. This ensures that the user can get all the information they want under one GTFS Realtime file. Previously, it was complex for one to search and collect GTFS Realtime information using Transitfeeds. Now everything will be under one single file.

Could you elaborate on what use case needs a common template for URIs with API keys? Is this to standardize how we indicate an API key is needed within a URL?

Transit provider definition, enumerated list, and array of aliases: Thanks for sharing a suggested structure for how we could provide a catalog of organizations and services in the working document. I’ve added a feature in the roadmap for expanding the catalogs that the community can vote on. (I’ve used some of the user stories you suggested for the search interface here since I believe this feature would address similar needs).

Upon further consideration, we think it makes sense to use “agency” as a starting point rather than transit provider, since

  • Agency can be defined as the agency name provided in agency.txt, and is discoverable within GTFS
  • Additional information needs to be provided alongside transit provider in order to enhance searchability (like aliases, or its associated services and brands, which likely make sense in a separate catalog)

For the purposes of launching V1 on the 23rd, we’ll be making this modification to agency in the schema. We’ll consider making the other transit provider related changes as part of V2 in Q2, and ask about the community’s priorities during our technical presentation on April 13th.

Desired Items

A few clarifying questions/comments based on our internal team review:

  • Who would define the priority of the dataset in the case of different data publishers and aggregate sources?
  • Including primary individual may be difficult to keep up-to-date, but we could include generic contact information for the corresponding organization.
  • Referenced datasets in included in the current GTFS Realtime schema as “static reference”.
e-lo commented 2 years ago

DataType: Currently within each GTFS Realtime source, there are three fields for Trip Updates, Service Alerts, and Vehicle Positions. This ensures that the user can get all the information they want under one GTFS Realtime file. Previously, it was complex for one to search and collect GTFS Realtime information using Transitfeeds. Now everything will be under one single file.

I agree that the user experience should be able to get all the realtime feeds with a single query, but that doesn't necessitate the data model do that as well. There are providers which have several realtime feeds of the same type (particularly for contracted service) and some which duplicative or enhanced feeds – so the desired user experience will still require the API (or whatever level of obfuscation) to query and assemble feeds from multiple entries.

Since the URLs are each optional, it effectively allows you to have an entry for each RT data type...but I do want to make sure the user experience isn't overly dependent on this structure.

e-lo commented 2 years ago

Could you elaborate on what use case needs a common template for URIs with API keys? Is this to standardize how we indicate an API key is needed within a URL?

Exactly. ie. https://www.myawesometransit.gov/gtfs/gtfs-alerts?key={API_KEY}

e-lo commented 2 years ago

Upon further consideration, we think it makes sense to use “agency” as a starting point rather than transit provider, since

  • Agency can be defined as the agency name provided in agency.txt, and is discoverable within GTFS

I actually think that overlap with agency.txt is actually a good reason not to use agency. The definition of an agency in agency.txt is actually a brand not an actual agency. This is confusing enough to explain and correct to transit providers and GTFS users that I would really love for us not to misuse the term yet again in a different (but also not accurate) context.

e-lo commented 2 years ago

Upon further consideration, we think it makes sense to use “agency” as a starting point rather than transit provider, since

  • Additional information needs to be provided alongside transit provider in order to enhance searchability (like aliases, or its associated services and brands, which likely make sense in a separate catalog)

You could alternatively use a "common name" as the "transit provider name" and then in a future catalog of transit providers add in "official organization name".

e-lo commented 2 years ago

Who would define the priority of the dataset in the case of different data publishers and aggregate sources?

This is really a question about an overall governance model – but ideally any changes to this priority in a PR would flag staff at the transit provider to review and disagree with.

e-lo commented 2 years ago

Referenced datasets in included in the current GTFS Realtime schema as “static reference”.

🙌

emmambd commented 2 years ago

@e-lo:

  • Re: GTFS Realtime: so you're suggesting a structure where each realtime link is its own source entry, and each gtfs realtime data type can be added to data type?
  • Thanks for clarifying. This week we'll add a standard API key structure for APIs that do authorization in their URL.
  • Based on this feedback, we'll proceed with transit provider rather than agency and use the "common name" definition until we add the providers catalog.
e-lo commented 2 years ago

Re: GTFS Realtime: so you're suggesting a structure where each realtime link is its own source entry, and each gtfs realtime data type can be added to data type?

I think this has the maximum flexibility and search ability. Again - happy to hear reasoning for alternative that meet the needs/situations described above.

I mainly just don't want to oversimplify the data model and then have a bunch of technical debt if/when it needs to be updated based on cases we already know exist in some significant number...

emmambd commented 2 years ago

@e-lo Thanks for clarifying. Over the past week, we've heard some concern from a consumer perspective with the realtime feed information for one provider living in multiple sources, making the info more difficult to search and parse. Could you provide 2-3 examples of this use case with multiple realtime feeds of the same type so we could consider how to model it? We're considering making the static reference field a list, and the URLs nested so multiple URLs of the same type could be included in the same source file.

Since these discussions are still ongoing, and we agree that we want to avoid considerable technical debt, we plan to delay importing the realtime data until later in Q2. The release plan will be reflected to include this update.

e-lo commented 2 years ago

We're considering making the static reference field a list, and the URLs nested so multiple URLs of the same type could be included in the same source file.

If this happens, the list should be an object such that it can be individually queried/filtered for the following use cases (which could end up adding complexity depending on how implemented):

  1. Multiple feed versions, only one works with Realtime

Transit Provider X has three published GTFS datasets, but only one "syncs" with their realtime feeds. In order to link their realtime feed with the correct static feed, I need to reference a specific schedule dataset.

There are lots of examples here (69 in our current data for California), including all Bay Area datasets, Victor Valley, Tulare, Thousand Oaks, Simi Valley, Santa Ynez, Ojai, Sacramento, Gold Coast, Glenn, etc.

In many (not all) of these cases this is caused because there is a CAD/AVL/Realtime service provider which needs to update the static dataset in order to publish a static dataset which is consistent with realtime –this most often occurs when there is a combination of services with the same realtime feed and naming conflicts need to be avoided, such as in the Bay Area and Ventura County which produce a single set of combined realtime feeds.

  1. Multiple datasets from different published URLs come together to produce a complete schedule.

Transit providers sometimes need to publish different services in separate GTFS Schedule datasets for various reasons such as contracted service agreements (e.g.Visalia and V-Line) and feed size (e.g. LA Metro). In other cases, providing certain variables in a query to to a GTFS Schedule API will yield different services (e.g. Bay Area 511). In all cases, we likely need to know which combination of feeds produce the entirety of service.

  1. Feeds which contain services represented in other feeds

In some cases transit providers publish data on supporting services which aren't directly managed by them and overlap with the transit provider's GTFS Schedule dataset which provides them. As a data user, I need to understand which parts of the dataset contain duplicates of service which should be screened out, deferring to a separate feed for the information that the transit provider which manages that service wants me to see.

For example, the Amtrak Schedule Dataset (whoot!) contains many supportive services such as the Altamont Corridor Express (ACE). ACE is also included in Bay Area 511 among other feeds. As a data consumer, I'd like to know which GTFS Schedule Dataset I should consume ACE information from, from the transit provider's perspective (if possible)

e-lo commented 2 years ago

Over the past week, we've heard some concern from a consumer perspective with the realtime feed information for one provider living in multiple sources, making the info more difficult to search and parse.

From a transparency perspective, it would be great to have the use cases and discussion from the transit consumers here in this issue.

(Note: I'm definitely not doubting that there are very valid and important issues...I'd just prefer if we could all discuss in one place that is traceable)

e-lo commented 2 years ago

Could you provide 2-3 examples of this use case with multiple realtime feeds of the same type so we could consider how to model it?

Offhand I can think of the following cases:

  1. Republishing combined feeds I think the biggest example is in the Bay Area, where most transit providers have their own realtime API but also have the combined Bay Area 511 API.

  2. Adding coverage with a low-cost option Another example would be for transit providers that we are trying to add route coverage for with our GTFS Realtime as a Service (GRaaS) product where not all of their current services have Realtime capabilities - so there is a separate URL for them. Some of these are small (Desert Roadrunner, Tulare, etc.) but others are big and important (Clean Air Express, Amtrak Thruway).

  3. Contracted service One of the biggest issues we've seen is that a portion of a transit provider's service may be operated by a contractor and is often not integrated into the transit provider's business processes/technology in the same way. We are working on daylighting the realtime data for all of these services, but the easiest path to this is thru a separate publishing process. Some examples (not live yet, but we hope will be eventually) include:

    • Visalia Transit // V-LINE
    • Almost all of the demand-responsive microtransit
barbeau commented 2 years ago

Could you provide 2-3 examples of this use case with multiple realtime feeds of the same type so we could consider how to model it?

HART is one such case here in Tampa, FL. They have a single GTFS dataset that covers their bus and streetcar. Bus originally had RT data (OrbCAD system, and we at USF built a GTFS Realtime exporter for it), but streetcar did not (streetcar was a separately managed system). RT was added to streetcar via Swiftly.

So the resulting system has a single GTFS, but two GTFS Realtime endpoints for TripUpdates.

To model these cases, my preference would be to see something like this (URLs aren't real here, as I'm not sure if the streetcar URL is public):

{
    "mdb_source_id": 100,
    "data_type": "gtfs_rt",
    "provider": "Hillsborough Area Regional Transit",
    "name": "Hillsborough Area Regional Transit GTFS Realtime",
    "static_reference": 120,
    "real_time_feeds": {       
        "vehicle_positions": [
          {
            "url": "https://www.hart.org/bus/bus-vehicle-positions.pb",
            "license": "LicenseA",
        "authentication_info_url": "https://www.hart.org/developer_info",
        "authentication_type": 2,
        "api_key_parameter_name": "key"
         },
         {
            "url": "https://www.hart.org/streetcar/v1/key/API_KEY/streetcar-vehicle-positions.pb",
            "license": "LicenseB",
        "authentication_info_url": "https://www.hart.org/developer_info",
        "authentication_type": 1
         }
         ],
        "trip_updates": [
          {
            "url": "https://www.hart.org/bus/bus-trip-updates.pb",
            "license": "LicenseA",
        "authentication_info_url": "https://www.hart.org/developer_info",
        "authentication_type": 2,
        "api_key_parameter_name": "key"
         },
         {
            "url": "https://www.hart.org/streetcar/v1/key/API_KEY/streetcar-trip-updates.pb",
            "license": "LicenseB",
        "authentication_info_url": "https://www.hart.org/developer_info",
        "authentication_type": 1
         }
         ],
    }
}

This allows us to model many attributes for each endpoint as needed, but still keeps the endpoints logically grouped under the same provider.

The authentication_type, authentication_info_url, and api_key_parameter_name parameters are taken from this discussion of extending GTFS with links to RT feeds: https://github.com/google/transit/pull/93

Note the API key structure in the streetcar URL. This will be harder to model in a directory than a simple URL parameter because it's integrated into the URL itself, which is why I've assigned a "authentication_type": 1 (ad-hoc) based on the current definitions in https://github.com/google/transit/pull/93. We could try to model this with a placeholder value that could be defined, which the consumer could replace with the actual API key.

Something like:

  • authentication_type 3 = A placeholder text value is provided within the URL, provided in the field api_key_placeholder_name. Consumers should replace the text api_key_placeholder_name
  • api_key_url_placeholder_name = A text value that appears in the url field that should be replaced by the consumer with the actual API key. Required if authentication_type is 3.
         {
            "url": "https://www.hart.org/streetcar/v1/key/API_KEY/streetcar-trip-updates.pb",
            "license": "LicenseB",
        "authentication_info_url": "https://www.hart.org/developer_info",
        "authentication_type": 3,
        "api_key_url_placeholder_name": "API_KEY"
         }
barbeau commented 2 years ago

Actually, looking back at the GTFS linked datasets proposal, Swiftly commented here asking for another authentication_type: https://github.com/google/transit/pull/93#issuecomment-792891386

For authentication_type, could we also add the following?

  • 3: The authentication requires an HTTP header, which should be passed as the value of the header api_key_parameter_name in the HTTP request.

We (Swiftly) generally prefer for consumers of real-time feeds to use this instead of a URL parameter to help protect the value of the API key.

Not sure if the streetcar URL format is an older or newer API key format for them since that comment.

e-lo commented 2 years ago

I would generally prefer handlebar-like syntax with expected values.

         {
            "url": "https://www.hart.org/streetcar/v1/key/{API_KEY}/streetcar-trip-updates.pb",
            "license": "LicenseB",
        "authentication_info_url": "https://www.hart.org/developer_info",
        "authentication_type": 3,
         }
emmambd commented 2 years ago

Over the past week, we've heard some concern from a consumer perspective with the realtime feed information for one provider living in multiple sources, making the info more difficult to search and parse.

From a transparency perspective, it would be great to have the use cases and discussion from the transit consumers here in this issue.

(Note: I'm definitely not doubting that there are very valid and important issues...I'd just prefer if we could all discuss in one place that is traceable)

@barbeau was who we were discussing this with previously so the relevant use cases so far have been mentioned now.

Thanks to both of you for the above use cases and suggested approach going forward. I'm going to share this with the MobilityData team internally over the next few weeks after our quarterly planning process and get back to you with any relevant changes and how it'll accommodate the use cases you've provided. Let me know if you have any questions or concerns.

e-lo commented 2 years ago

Enumerated of "transit providers" to avoid duplication

This issue still hasn't been resolved in the JSON schema. There are many important feeds with multiple transit providers.

e-lo commented 2 years ago

Looking at the (very) lengthy filenames that are now in the catalog, I'm wondering about the use of subdivision name as opposed to subdivision code - both of which are defined in the ISO table. We use country code not name, why not be consistent?

emmambd commented 2 years ago

This issue still hasn't been resolved in the JSON schema. There are many important feeds with multiple transit providers.

Agreed, this issue hasn’t been resolved. Until we provide a catalog of organizations and providers, it’s unclear on our side how we could best achieve this enumerated list. Is there a lighter weight solution you're envisioning?

Looking at the (very) lengthy filenames that are now in the catalog, I'm wondering about the use of subdivision name as opposed to subdivision code - both of which are defined in the ISO table. We use country code not name, why not be consistent?

The original rationale behind this was around ease of entry and search - we didn’t want to require users to input the subdivision code name or search for it in instances where it isn't commonly used. However, it would make sense for us to alter the implementation of the file name at the bare minimum so they’re less lengthy (issue added here).

evansiroky commented 2 years ago

I agree with @e-lo that these complex use cases of multiple RT feeds referring to one Schedule feed (and vice versa) has not been fully represented in the current schema. Perhaps one lightweight and interim approach to these challenging use cases is to add a note field that can be a place to explain these situations. I think it does make sense to discuss this no later than when the catalog of organizations and providers item is discussed.

barbeau commented 2 years ago

@evansiroky @e-lo Looking back at https://github.com/MobilityData/mobility-database-catalogs/issues/36#issuecomment-1076734944, I think the one use case I didn't illustrate there is one RT feed to many static feeds - did I miss anything else?

I think the one RT feed to many static feeds could be represented by making the mdb_source_id and static_reference elements arrays instead of single values, like:

{
    "mdb_source_id": [100, 101],
    "data_type": "gtfs_rt",
    "provider": "Hillsborough Area Regional Transit",
    "name": "Hillsborough Area Regional Transit GTFS Realtime",
    "static_reference": [120, 121],
    "real_time_feeds": {    
        ...

Do you know of any cases this doesn't cover, or reasons why this wouldn't work?

evansiroky commented 2 years ago

@barbeau I think this may cover most use cases that are present until the catalog of organizations and providers item is discussed.

maximearmstrong commented 2 years ago

@barbeau I think the schema here you suggested is great.

Here, why would we need a mdb_source_id to be represented by an array instead of a single value? In the example, would it be that mdb_source_id = 100 is related to static_reference = 120 only for instance, but that mdb_source_id = 100 and mdb_source_id = 101 share the same provider?

barbeau commented 2 years ago

You only need the mdb_source_id to be an array if you have the scenario where you need to map a GTFS RT feed to more than one source. So it really depends on your definition of "source". If you don't have this case, then a single value of mdb_source_id would be sufficient to map a GTFS RT feed back to a single source.

I think MTA is a good test for this model: http://web.mta.info/developers/developer-data-terms.html#data

So, for example, if MTA Transit Bus is represented as one source with multiple GTFS static files (Brox, Brooklyn, Manhattan, Queens, Staten Island), then you could have a single GTFS RT record with a single mdb_source_id but multiple static_reference to link it back to the static sources.

If you wanted to treat MTA Brox as it's own source, then you'd need an array for mdb_source_id to reference multiple sources from the MTA Transit Bus GTFS RT feed record.

emmambd commented 2 years ago

Since the goal is to make it easier for consumers to see which GTFS schedule sources are tied to a realtime source, we think keeping mdb_source_id as 1 unique value and associating several static_reference values will be sufficient and less confusing.

I've opened a new issue specifically focused on realtime changes to track our progress in updating the schema and an associated PR. The only notable changes from @barbeau 's original proposal are

  • Making mdb_source_id 1 value instead of an array
  • Changing license to license_url to make it easier to discover information and match to GTFS schedule schema
  • Changing url to direct_download_url to match GTFS schedule schema and be clearer about URL purpose
  • Adding note field as requested by @evansiroky
  • Including URLs with API key placeholder text provided as a fourth authentication type

Please feel free to take a look and comment on the PR.

e-lo commented 2 years ago

...deleted earlier one b/c now I realize that this was in reference to mbd_source_idm not the static one :-)

emmambd commented 2 years ago

The realtime schema has been implemented! I've separated out the remainder of this big conversation into the following outstanding issues:

In order to make the discussion easier to follow in the future, I'm going to close this issue.