MobilityData / gtfs-validator

Canonical GTFS Validator project for schedule (static) files.
https://gtfs-validator.mobilitydata.org/
Apache License 2.0
275 stars 100 forks source link

Check that GTFS schedule URL ends in ".zip" #1278

Open owades opened 1 year ago

owades commented 1 year ago

Describe the problem

The GTFS Best Practices include the following best practice:

Datasets should be published at a public, permanent URL, including the zip file name. (e.g., www.agency.org/gtfs/gtfs.zip)...

We don't have a check that confirms that the gtfs feed URL contains the filename

Describe the new validation rule

If GTFS schedule URL does not end in ".zip", trigger a "missing_zip_file_name" notice

Sample GTFS datasets

GTFS schedule URL without zip file name: City of Commerce, CA, USA: https://citycommbus.com/gtfs

Severity

Warning

Additional context

No response

isabelle-dr commented 1 year ago

Thank you for opening this, interesting!

One could argue that an URL called https://citycommbus.com/gtfs, giving a gtfs.zip file when downloaded "contains the zip file name", just without the ".zip" extension. If we want to validate this Best Practice, maybe we could go further and check if the name of the file downloaded is included in the name of the URL.

Is not having the ".zip" causing problems? Is it breaking something?

isabelle-dr commented 1 year ago

I labeled "help wanted" because I'd like to hear what others think of this validation rule.

owades commented 1 year ago

Thanks @isabelle-dr, I appreciate your input and I am also interested in what others think. My goal here is to capture the guideline accurately, and if I'm misunderstanding what the guideline means we can close out this request.

derhuerst commented 1 year ago

One could argue that an URL called https://citycommbus.com/gtfs, giving a gtfs.zip file when downloaded "contains the zip file name", just without the ".zip" extension.

I agree that this is a possible interpretation of the spec.

Also, I assume the intention behind this is that, when people download the GTFS dataset using their browser on a platform where file extensions determine its behaviour, the file should be named *.zip; This is the case if the server sends a Content-Disposition: attachment; filename="….zip" header, even if the URL's path doesn't end with .zip.

The cost of checking the Content-Disposition header is of course a lot of complexity: The GTFS Validator will have to do an HTTP request, and people will (understandably) ask for support for Basic Auth, specific custom headers (e.g. auth keys), etc.