cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
47 stars 12 forks source link

Reorganize GTFS "catalog" as many/many relationship of transit providers GTFS datasets #21

Closed e-lo closed 3 years ago

e-lo commented 3 years ago

Whereas some GTFS datasets have multiple transit providers (e.g. MTC's) and some transit providers have multiple datasets (e.g. LACMTA), the GTFS data catalog needs to be formatted and maintained to acknowledge this relationship.

Options

1. List of feeds by transit provider

     LACMTA: [LACMTA Rail link..., LACMTA bus link...]
     BART: [MTC regional feed link...]

2. List of feeds by tuple of transit providers included

     (LACMTA) : [LACMTA Rail link..., LACMTA bus link...]
     (BART, Caltrain, SFMTA, Santa Rosa Citybus) : [MTC regional feed link...]

Other?

hunterowens commented 3 years ago

not to plug yml which is gonna be harder for the trillium folks to edit but could do. Seems like the MTC feed has individual feeds per operator that you can get

agency_1: 
  url: 
  - item 

agency_2:
  url: 
  - item
  - item2 

agency_mtc:
  url: 
  - their_feed_subset
  - regional_feed
hunterowens commented 3 years ago

(this might be just a restating of 2)

e-lo commented 3 years ago

(Totally fine with yml, it can parse out to same thing)

I think what you wrote is same as 1, which I think is the preferable answer - you'll just want to track feeds you've already validated if they are duped.

Interesting about feed subsets. It might be that we want to do the validation and grading on both the subset and the regional feed b/c the subset will have some agency-specific stuff in there.

hunterowens commented 3 years ago

@antrim brought up yesterday on the call the idea of using DRMT, which is currently used by the new transitland-atlas.

DRMT is fairly new but does capture the following elements of NTD id (at least in the transitland atlas as a "tag". Monitoring changes to the static_current key would allow us to track changes in static url which are part of the CA GTFS guidelines.

I think the two most viable options for moving the list away from a Google Sheet are either to capture and create a repository full of DRMT files for each CA agency, or do a simpler yml or csv based version tracking the following pieces of information

and store it in Github.

The MTC situation is a little messy in this format, but I think if we stick with roughly a 1) based option based on @e-lo thoughts above, it will be most compatible with MobilityDatabase in the future even if it requires a bit of custom code.

Should we store any agency metadata aside from itp_id in the github based list? ie, agency_name etc or should we just join on itp_id with the Google sheet as needed.

` Here's our friends MST represented in the DRMT format, fwiw.

{
  "onestop_id": "o-9q9-monterey~salinastransit",
  "tags": {
    "us_ntd_id": "90062"
  },
  "name": "Monterey-Salinas Transit",
  "short_name": "MST",
  "associated_feeds": [
    {
      "feed_onestop_id": "f-9q9-monterey~salinastransit",
      "gtfs_agency_id": ""
    },
    {
      "feed_onestop_id": "f-mst~rt",
      "gtfs_agency_id": ""
    }
  ]
}
{
  "$schema": "https://dmfr.transit.land/json-schema/dmfr.schema-v0.3.0.json",
  "feeds": [
    {
      "spec": "gtfs",
      "id": "f-9q9-monterey~salinastransit",
      "urls": {
        "static_current": "https://www.mst.org/google/google_transit.zip"
      },
      "feed_namespace_id": "o-9q9-monterey~salinastransit",
      "license": {
        "url": "https://mst.org/about-mst/developer-resources/"
      }
    },
    {
      "spec": "gtfs-rt",
      "id": "f-mst~rt",
      "urls": {
        "realtime_alerts": "http://206.128.158.191/TMGTFSRealTimeWebService/Alert/Alerts.pb",
        "realtime_trip_updates": "http://206.128.158.191/TMGTFSRealTimeWebService/TripUpdate/TripUpdates.pb",
        "realtime_vehicle_positions": "http://206.128.158.191/TMGTFSRealTimeWebService/Vehicle/VehiclePositions.pb"
      },
      "feed_namespace_id": "o-9q9-monterey~salinastransit",
      "license": {
        "url": "https://mst.org/about-mst/developer-resources/"
      },
      "associated_feeds": [
        "f-9q9-monterey~salinastransit"
      ]
    }
  ],
  "license_spdx_identifier": "CDLA-Permissive-1.0"
}
hunterowens commented 3 years ago

putting inline the list of agencies where the link under the GTFS column in the sheet either 404s or doesn't return a valid ZIP file.

['Santa Rosa CityBus', 'County Connection', 'Amador Regional Transit System', 'Anaheim Resort Transportation', 'Avalon Transit', 'Banning Pass Transit', 'Beaumont Pass Transit', 'Calaveras Transit', 'Caltrain', 'Camarillo Area Transit', 'Lawndale Beat', 'Clovis Transit System', 'Commerce Municipal Bus Lines', 'Corona Cruiser', 'Redwood Coast Transit', 'East Los Angeles Shuttle', 'Sunshine Bus(South Whittier)', 'the Link Florence-Firestone/Walnut Park', 'the Link-Athens', 'the Link Lennox', 'the Link Willowbrook', 'East Valinda Shuttle', 'Avocado Heights/Bassett/West Valinda Shuttle', 'the Link King Medical Center', 'Duarte Transit', 'Eastern Sierra Transit Authority', 'Mammoth Lakes Transit System', 'El Dorado Transit', 'El Monte Transportation Division', 'Emery Go-Round', 'Fairfield and Suisun Transit', 'GTrans', 'Humboldt Transit Authority', 'Arcata and Mad River Transit System', 'Eureka Transit Service', 'Blue Lake Rancheria', 'Kern Transit', 'Laguna Beach Municipal Transit', 'Tahoe Transportation', 'Tahoe Truckee Area Regional Transportation', 'Lake Transit', 'Madera County Connection', 'Mendocino Transit Authority', 'Merced The Bus', 'Mission Bay TMA', 'Spirit Bus', 'Moorpark City Transit', 'Morongo Basin Transit Authority', 'MVGO', 'Needles Area Transit', 'Norwalk Transit System', 'Desert Roadrunner', 'Petaluma Transit', 'Placer County Transit', 'Lincoln Transit', 'Plumas Transit Systems', 'Palos Verdes Peninsula Transit Authority', 'Redding Area Bus Authority', 'Burney Express', 'Rio Vista Delta Breeze', 'Sage Stage', 'County Express', 'San Francisco Bay Ferry', 'Simi Valley Transit', 'Siskiyou Transit and General Express', 'Sonoma-Marin Area Rail Transit', 'Santa Maria Area Transit', 'SolTrans', 'SolanoExpress', 'Sonoma County Transit', 'Cloverdale Transit', 'South County Transit Link', 'Stanislaus Regional Transit', 'Turlock Transit', 'Ceres Area Transit', 'Tehama Rural Area eXpress', 'Lassen Transit Service Agency', 'Susanville Indian Rancheria Public Transportation Program', 'Thousand Oaks Transit', 'Tideline', 'Trinity Transit', 'Vacaville City Coach', 'Ventura County Transportation Commission', 'Victor Valley Transit', 'Vine Transit', 'WestCAT', 'Yosemite Area Regional Transportation System', 'Yuba-Sutter Transit Authority', 'Porterville Transit', 'Burbank Bus', 'Big Blue Bus', 'Folsom Stage Line', 'Roseville Transit', 'Sacramento Regional Transit District', 'Unitrans', 'Yolobus', 'DASH', 'Commuter Express', 'Marin Transit', 'Morro Bay Transit', 'Santa Ynez Valley Transit', 'San Joaquin Regional Transit District', 'Santa Barbara Metropolitan Transit District', 'Santa Cruz Metropolitan Transit District', 'Capitol Corridor', 'Clean Air Express', 'Gold Coast Transit', 'North County Transit District', 'Monterey-Salinas Transit', 'OmniTrans', 'SamTrans', 'Fresno Area Express', 'MUNI', 'Long Beach Transit', 'Orange County Transportation Authority', 'Irvine Shuttle', 'Golden Gate Bridge Highway and Transportation District', 'Marguerite Shuttle', 'Bay Area Rapid Transit', 'Menlo Park Shuttles', 'Metrolink', 'Modesto Area Express', 'Riverside Transit Agency', 'San Diego Metropolitan Transit System', 'SunLine Transit Agency', 'Yuma County Area Transit', 'Madera Area Express', 'Bear Transit', 'Montebello Bus Lines', 'Carson Circuit', 'Huntington Park Express', 'DowneyLINK', 'Bell Gardens', 'Cudahy Area Rapid Transit', 'Baldwin Park Transit', 'Calabasas Transit System', 'Compton Renaissance Transit Service', 'Rosemead Explorer', 'Bellflower Bus', 'Go West Shuttle', 'Arcadia Transit', 'La Campana', 'Glendora Transportation Division', 'Delano Area Rapid Transit', 'Guadalupe Flyer', 'Arvin Transit', 'Auburn Transit', 'Blossom Express', 'Ridgecrest Transit', 'San Juan Capistrano Free Weekend Trolley', 'Alhambra Community Transit', 'Union City Transit']
hunterowens commented 3 years ago

Here's my thought on format. Would treat ITP ID and name string as the metadata.

to my knowledge, nobody has multiple GTFS-RT feeds, but could adopt the {list(url)} structure if needed.

agency_1: 
  itp_id: {num}
  name_string: {"some string"}
  gtfs_schedule_url: 
  - item
  gtfs_rt: 
    trip_updates: {url}
    vehicle_locations: {url}
    alerts: {url}

agency_2:
  itp_id: {num}
  name_string: {"some string"}
  gtfs_schedule_url:: 
  - item
  - item2 

agency_mtc:
  itp_id: {num}
  name_string: {"some string"}
  gtfs_schedule_url:: 
  - their_feed_subset
  - regional_feed
hunterowens commented 3 years ago

cc @antrim @e-lo

antrim commented 3 years ago

Do we need way of relating static and real-time feed URLs if there are multiple static URLs? Something like so?

agency_mtc:
  itp_id: {num}
  name_string: {"some string"}
  gtfs_url:: 
  - their_feed_subset
       static: {url}
       gtfs_rt: 
         trip_updates: {url}
         vehicle_locations: {url}
         alerts: {url}
  - regional_feed
hunterowens commented 3 years ago

I think inside the gtfs_url object, static should be a list of URLs to handle the one agency has many static download urls case.

Given that MTC is a big portion of the state, we can either ignore the regional feed or code it inside a second name object, which is the approach above.

Here's what I think a MTC Agency should look like

ac_transit:
  itp_id: 
  name_string: "AC Transit"
  gtfs_schedule_url:
    - https://api.actransit.org/transit/gtfs/download?token=2512B81107A09D2DC44895CDDC650D47
    - http://api.511.org/transit/datafeeds?api_key=[your_key]&operator_id=[AC_TRANSIT_ID] 
  gtfs_rt:
    trip_updates: 
      - http://api.actransit.org/transit/Help/Api/GET-gtfsrt-tripupdates
      - http://api.511.org/transit/tripupdates?api_key=[your_key]&agency=[AC_TRANSIT_ID] 
    .... (so on for each of the three GTFS-RT feeds) 

Essentially, each key URL key should have a list of URL values, I think. Note, those urls I posted above should be equivalent in content but... we should monitor and find out.

I can take a first pass at getting a PR for this ready today b/c I have a pending ask from @mcplanner.

antrim commented 3 years ago

The way I understand, this would depend on linked_datasets.txt to associate the GTFS (static) and GTFS-realtime feeds. That seems like a potential issue, given that it's not yet officially adopted and widespread use would be a ways out.

This would be used internally by Cal-ITP, yes? Would it ever be published externally? If so, I see an issue storing API keys in the URL. It might be useful to separate out some of the API information. linked_datasets.txt provides inspiration: https://github.com/google/transit/pull/93/files

Also, would it be useful to have a URL with terms/license/API key info?

hunterowens commented 3 years ago

I think they are all linked in that they are in a shared object?

antrim commented 3 years ago

In the above example though, how are https://api.actransit.org/transit/gtfs/download?token=2512B81107A09D2DC44895CDDC650D47 and http://api.actransit.org/transit/Help/Api/GET-gtfsrt-tripupdates explicitly linked?

On Mon, Mar 8, 2021 at 1:01 PM Hunter Owens notifications@github.com wrote:

I think they are all linked in that they are in a shared object?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cal-itp/data-infra/issues/21#issuecomment-793073993, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABRR3FZIW5KYQFTZRZXKSQTTCU3KBANCNFSM4XY54K7A .

-- Aaron Antrim (he/him) CEO & Founder Trillium http://trilliumtransit.com/ - We make transit easier to use. +1 (503) 567-8422 ext. 3

e-lo commented 3 years ago

The way I understand, this would depend on linked_datasets.txt to associate the GTFS (static) and GTFS-realtime feeds.

Wouldn't an associated transit provider in MobilityDatabase for both the realtime and static be sufficient?

e-lo commented 3 years ago

Essentially, each key URL key should have a list of URL values, I think. Note, those urls I posted above should be equivalent in content but... we should monitor and find out.

In some cases, but not always. Case in point LAMTA or providers which contract out part of their service. In any case, we should clarify how we document the spanning of the dataset, noting that MobilityDatabase will be doing same thing so we should be consistent if possible. Strawperson for static:

Then a rule for hierarchy in the case of conflict e.g. most preferable sources at top, or last published, etc.

hunterowens commented 3 years ago

fyi, first draft of this PR is now live in #23

hunterowens commented 3 years ago

This is complete, or at least, mostly done and can be reopened w/ new sub-issues as we use the new file!