Closed wklumpen closed 1 year ago
@wklumpen Thanks for flagging this issue! There shouldn't be any difference between the datasets from the two URLs - if there is, this likely indicates a bug with our Github Actions. Right now there's a cronjob that runs each day to check the direct_download
URL, and update the latest
URL if the dataset at the direct_download
URL has been changed. Our team will take a look to see why this is happening.
Of note: I've com across a few broken direct_download
links, e.g. 498 Frederick County: https://maps.frederickcountymd.gov/google/google_transit.zip
I'd like to trust the latest
URL as it's much more stable, but at the moment it's not, and I imagine with a broken link the stable URL wouldn't be updating anyway.
Should I raise an issue (that would then be linked presumably to a PR) for each broken URL? I don't want to come in and stomp on whatever workflow you have going for this.
@wklumpen I think the broken direct_download
URLs is a separate issue (stale data) vs. the out-of-date latest
URLs (broken pipeline). You're right that you should be able to trust the latest
URLs, so we'll prioritize looking at this issue ASAP.
For the direct_download
links, you can open a separate issue with all the broken URLs you've found, with a 1 linked PR for working replacements you've found. (1 PR for each broken URL will likely take more of your time than it's worth!)
Thanks for checking in and asking about the best approach for this — it's very helpful that you've flagged this!
Sounds good. There will probably be more to come as I go through basically every agency in a number of US urban areas :)
@wklumpen We always welcome the help in our data updating/cleaning efforts! Really appreciate it 🚀
@wklumpen PR #299 has fixed this error. Arlington Transit's latest
URL now pulls the most recent dataset. Let us know if you encounter this issue again - for now, it's closed.
Thanks! Maybe I'll write a little validation script for the feeds I'm interested in. Stale feeds will cause a problem but that's separate issue.
Maybe I'll write a little validation script for the feeds I'm interested in
You mean to check if the datasets at latest
and direct_download
match? For the actual repo, we'd probably need some kind of issue/Github alert to check when Store latest datasets cronjob fails 2+ days in a row so we can troubleshoot it. Maybe @fredericsimard has thoughts about how this could be implemented. I don't expect this issue to recur in the near future though thanks to #299.
Stale feeds will cause a problem but that's separate issue.
Stale as in there's another URL somewhere with more up-to-date data?
You mean to check if the datasets at
latest
anddirect_download
match?
Yes - I was going to do this for the feeds I'm using but internal validation on the MDB end would be even better
Stale as in there's another URL somewhere with more up-to-date data?
Yes, correct. I wonder if there's a possibility to detect if "active" feeds aren't actually being updated (e.g. the latest feed no longer covers the current date)
@wklumpen re: detecting date range from actual GTFS calendars, we plan on actually opening the feeds and sharing the dynamic data from datasets for V2 of the API we're developing right now (the logic will be from the GTFS Validator). But it won't be another 3-6 months, so if you want to do validation based on the text files themselves, I'd suggest going ahead and doing it yourself.
However, re: internal validation, if you are open to just relying on our cronjob pass/fail to verify if latest
and direct_download
match, that could be a contribution to the Mobility Database.
Apologies if this is asked and answered but a quick search didn't turn anything up.
I've noticed that the feeds that are archived in
latest
often do not match the datasets that come from thedirect_download
(e.g. fewercalendar_date
rows, etc.).An example: Arlington Transit (
mdb_id = 485
) direct download has calendar dates that extend to20240131
while thelatest
URL has dates only to20230902
Is this simply because the set hasn't been updated on a recent pass?
Some further info/documentation on the differences between the two would be ideal, as I'm struggling to understand them from the current field descriptions.