cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
47 stars 12 forks source link

GTFS Data: reconcile URL differences between airtable and agencies.yml #1178

Closed evansiroky closed 2 years ago

evansiroky commented 2 years ago

There are currently a lot of URLs that are present in agencies.yml that are not present in the airtable gtfs datasets table and also many URLs present in airtable that are not present in agencies.yml. This is detailed in the agencies.yml URLs vs airtable URIs dashboard. We should reconcile each of these URLs to make sure these data sources agree with each other.

e-lo commented 2 years ago

Several of these are how we store the URIs // aren't actual changes. i.e. using the 511 feed to pull AC Transit using an API key vs not and using {{MTC_511_API_KEY}} vs inserting the actual API key.

e-lo commented 2 years ago

~I'm not sure there are any differences other than these? Looking at the dashboard you made it looks like it is listing "everything" rather than the ones just with differences. I did a few spot checks of various other listings and they all seemed OK in both locations but I'd love to know if/when there are differences that we should care about!~

evansiroky commented 2 years ago

We believe that @e-lo happened to look at this dashboard during the time when the pipeline was still processing data and therefore the data was not correct. This bug is noted in #1064.

e-lo commented 2 years ago
  1. Unicode vs ASCII Substitutions

Another discrepancy is unicode (Airtable) vs ASCII substitutions (in agencies.yml). Can we standardize around unicode which is a valid URL or is there a reason we wouldn't want to that?

Example (these both work and go to same place):

https://www.avta.com/userfiles/files/AVTA%20GTFS.zip

vs

https://www.avta.com/userfiles/files/AVTA_GTFS.zip

e-lo commented 2 years ago
  1. Case discrepancies which both work.

Sometimes URIs are case sensitive. Sometimes not.

I think we can either (a) standardize around lowercase - which would require editing a lot of URIs in both agencies.yml and airtable (b) resolve the discrepancies that currently exist and not worry about it until there is a problem (my preference given that it isn't super important)

Example (both of these work)

http://www.cleanairexpress.com/GTFS/GTFS.zip

vs

http://www.cleanairexpress.com/gtfs/gtfs.zip

e-lo commented 2 years ago

Bay Area Ferries Schedule is same in airtable and agencies.yml - not sure why this is coming up.

e-lo commented 2 years ago

Santa Maria Area Transit: They let us know that they aren't using Trillium anymore (the feed hasn't been updated since October) so we deleted it from Airtable...but haven't identified a new feed source. I believe there were some meetings on the calendar to get more info from them. @o-ram were you part of meeting with them? Otherwise it was GRAAS team. Will investigate if/when olivia responds.

e-lo commented 2 years ago

I just updated SolTrans (keeping the Trillium feed as an archive b/c it is far superior w.r.t. data) and Tuolumne (which pointed to same google place...just using different URI schemes).

Last schedule one remaining is Tulare - which I'll address with overall tcag fix

e-lo commented 2 years ago

RT: Update YoloBus

e-lo commented 2 years ago

For OCTA: Airtable uses the OCTA domain - which is preferable to the swiftly one. I would advocate for changing it in agencies.yml to do this also unless there is a reason we don't have it that way there?

e-lo commented 2 years ago

Updated/added SJRTD URIs

e-lo commented 2 years ago

Outstanding (will address when I get back to my computer later this PM)

evansiroky commented 2 years ago

I'm thinking the best idea around points 1-3 is to modify the URLs in airtable to match those in agencies.yml. I can try to find some time to do that.

o-ram commented 2 years ago

@e-lo regarding Santa Maria, I have been trying to figure out what they are doing with GTFS. I was in a meeting with their relatively new transit manager back in Jan. about their interest in contactless payments and learned

Since then, I downloaded TripShot myself and was able to confirm that they do appear within the App. There also appears to be some mechanism for providing RT info. I haven't been able to locate a feed URL though or get one from Santa Maria. Image from iOS

I'm happy to reach back out to SM and ask. The person I met with was supposed to confirm the TripShot info with their IT team and get back to me anyway and never did, so I have a good reason to ask.

e-lo commented 2 years ago

I'm happy to reach back out to SM and ask. The person I met with was supposed to confirm the TripShot info with their IT team and get back to me anyway and never did, so I have a good reason to ask.

That would be awesome

e-lo commented 2 years ago

@evansiroky said:

I'm thinking the best idea around points 1-3 is to modify the URLs in airtable to match those in agencies.yml. I can try to find some time to do that.

I already did 3 (fix casing)

evansiroky commented 2 years ago

I already did 3 (fix casing)

Cool, thanks. I'm going to get started on some more.

  • I'm meh on changing unicode to ASCII and would prefer to do the reverse - do we have to do that for some reason in our pipeline? When we or an agency advertises a feed, we would do so with an underscore _ not a %20

I also like the _ better, but am more interested in just getting this done quickly, so I'm going to update airtable to have the annoying encoded characters.

  • Re Bay Area feeds, I think I want to keep the regional feed together and aggregate the services b/c it keeps the fares intact. I don't see value in updating to do agency-specific ones in airtable?

I also don't think we need to add each disaggregated service (to airtable).

evansiroky commented 2 years ago

I just went through the remaining URLs in agencies.yml that weren't in airtable.

Here are my responses to some of your comments:

Bay Area Ferries Schedule is same in airtable and agencies.yml - not sure why this is coming up.

This is probably happening since it occurs twice in agencies.yml and is used as a join condition. One of these feeds should probably be removed from agencies.yml.

For OCTA: Airtable uses the OCTA domain - which is preferable to the swiftly one. I would advocate for changing it in agencies.yml to do this also unless there is a reason we don't have it that way there?

It seems that the number of RT validation errors differs between their two RT feeds, so maybe they are distinct data sources. Not sure what the analysts are using.

I just updated SolTrans

Is the trip update URL in airtable correct?

evansiroky commented 2 years ago

Elk Grove added to agencies.yml via https://github.com/cal-itp/data-infra/pull/1224.

evansiroky commented 2 years ago

There are still over 20 URLs in airtable that aren't in agencies.yml. @e-lo can you take a look at these URLs? If they should not be ingested in the pipeline, it would be great to have some notes about why. Perhaps there should be some kind of flag about whether certain feeds should not be ingested in the pipeline. That might be useful for when we get around to #775.

e-lo commented 2 years ago

I'm meh on changing unicode to ASCII and would prefer to do the reverse - do we have to do that for some reason in our pipeline? When we or an agency advertises a feed, we would do so with an underscore _ not a %20

I also like the _ better, but am more interested in just getting this done quickly, so I'm going to update airtable to have the annoying encoded characters.

@evansiroky is there a reason the pipeline wont accept the _ right now?

e-lo commented 2 years ago

Is the trip update URL in airtable correct?

https://soltrans.connexionz.net/rtt/public/utility/gtfsrealtime.aspx/tripupdate2 Strangely, it is.

https://soltrans.connexionz.net/rtt/public/utility/gtfsrealtime.aspx/tripupdate Also downloads data - but it is different. The one with 2 is the one posted on their website. Wondering if it is GTFS Realtime v1?

evansiroky commented 2 years ago

https://soltrans.connexionz.net/rtt/public/utility/gtfsrealtime.aspx/tripupdate2 seems to have more information in it compared to the other one. I went ahead and made #1231 to update.

evansiroky commented 2 years ago

This issue will be ongoing until #775 is resolved. The Agencies.yml vs airtable comparison notes doc will be used to track ongoing issues.