MobilityData / mobility-database-catalogs

The Catalogs of Sources of the Mobility Database.
Apache License 2.0
257 stars 51 forks source link

[QUESTION] Why are the feeds from a `direct_download` and `latest` not the same? #296

Closed wklumpen closed 1 year ago

wklumpen commented 1 year ago

Apologies if this is asked and answered but a quick search didn't turn anything up.

I've noticed that the feeds that are archived in latest often do not match the datasets that come from the direct_download (e.g. fewer calendar_date rows, etc.).

An example: Arlington Transit (mdb_id = 485) direct download has calendar dates that extend to 20240131 while the latest URL has dates only to 20230902

Is this simply because the set hasn't been updated on a recent pass?

Some further info/documentation on the differences between the two would be ideal, as I'm struggling to understand them from the current field descriptions.

emmambd commented 1 year ago

@wklumpen Thanks for flagging this issue! There shouldn't be any difference between the datasets from the two URLs - if there is, this likely indicates a bug with our Github Actions. Right now there's a cronjob that runs each day to check the direct_download URL, and update the latest URL if the dataset at the direct_download URL has been changed. Our team will take a look to see why this is happening.

Example of failed action

wklumpen commented 1 year ago

Of note: I've com across a few broken direct_download links, e.g. 498 Frederick County: https://maps.frederickcountymd.gov/google/google_transit.zip

I'd like to trust the latest URL as it's much more stable, but at the moment it's not, and I imagine with a broken link the stable URL wouldn't be updating anyway.

Should I raise an issue (that would then be linked presumably to a PR) for each broken URL? I don't want to come in and stomp on whatever workflow you have going for this.

emmambd commented 1 year ago

@wklumpen I think the broken direct_download URLs is a separate issue (stale data) vs. the out-of-date latest URLs (broken pipeline). You're right that you should be able to trust the latest URLs, so we'll prioritize looking at this issue ASAP.

For the direct_download links, you can open a separate issue with all the broken URLs you've found, with a 1 linked PR for working replacements you've found. (1 PR for each broken URL will likely take more of your time than it's worth!)

Thanks for checking in and asking about the best approach for this — it's very helpful that you've flagged this!

wklumpen commented 1 year ago

Sounds good. There will probably be more to come as I go through basically every agency in a number of US urban areas :)

emmambd commented 1 year ago

@wklumpen We always welcome the help in our data updating/cleaning efforts! Really appreciate it 🚀

emmambd commented 1 year ago

@wklumpen PR #299 has fixed this error. Arlington Transit's latest URL now pulls the most recent dataset. Let us know if you encounter this issue again - for now, it's closed.

wklumpen commented 1 year ago

Thanks! Maybe I'll write a little validation script for the feeds I'm interested in. Stale feeds will cause a problem but that's separate issue.

emmambd commented 1 year ago

Maybe I'll write a little validation script for the feeds I'm interested in

You mean to check if the datasets at latest and direct_download match? For the actual repo, we'd probably need some kind of issue/Github alert to check when Store latest datasets cronjob fails 2+ days in a row so we can troubleshoot it. Maybe @fredericsimard has thoughts about how this could be implemented. I don't expect this issue to recur in the near future though thanks to #299.

Stale feeds will cause a problem but that's separate issue.

Stale as in there's another URL somewhere with more up-to-date data?

wklumpen commented 1 year ago

You mean to check if the datasets at latest and direct_download match?

Yes - I was going to do this for the feeds I'm using but internal validation on the MDB end would be even better

Stale as in there's another URL somewhere with more up-to-date data?

Yes, correct. I wonder if there's a possibility to detect if "active" feeds aren't actually being updated (e.g. the latest feed no longer covers the current date)

emmambd commented 1 year ago

@wklumpen re: detecting date range from actual GTFS calendars, we plan on actually opening the feeds and sharing the dynamic data from datasets for V2 of the API we're developing right now (the logic will be from the GTFS Validator). But it won't be another 3-6 months, so if you want to do validation based on the text files themselves, I'd suggest going ahead and doing it yourself.

However, re: internal validation, if you are open to just relying on our cronjob pass/fail to verify if latest and direct_download match, that could be a contribution to the Mobility Database.