MobilityData / gtfs-validator

Canonical GTFS Validator project for schedule (static) files.
https://gtfs-validator.mobilitydata.org/
Apache License 2.0
288 stars 101 forks source link

Investigating issues with parsing Flex feeds #1767

Closed emmambd closed 2 months ago

emmambd commented 6 months ago

What's the problem?

Out of the 4 Flex feeds that we have for testing purposes for #1721, 3 have failed to run through the validator without parsing issues.

I took a look at 51 Flex v2 feeds, including ones that don't conform to the official spec yet, for the sake of trying to better understand this problem. 50% fail to fully parse, and all but 1 of the feeds that failed have an issue with parsing stop_times.txt.

Outstanding questions

This is a critical set of questions to answer before we pursue more work on #1721

qcdyx commented 6 months ago

In code, we encountered UNPARSABLE_ROWS due to validation errors while processing the rows of GTFS files. For example, stop_times.txt had errors such as unknown_column and missing_required_field.image For agency.txt, there's invalid_timezone and invalid_url ERROR.

image
qcdyx commented 6 months ago

Analysis here: https://docs.google.com/document/d/1XLnJm8-M4jZpizr5wdQPETMJFonTA8l3xHrVPVhIHJw/edit

qcdyx commented 5 months ago

Based on the investigation on #1770 , it's the missing_required_field, invalid_url, and invalid_timezone that lead to validation errors and make a GTFS file unparsable.

emmambd commented 5 months ago

Moving @qcdyx findings from #1770 here:

It's the missing_required_field 'stop_id' that leads to a validation error, which makes stop_times.txt have a status of UNPARSABLE_ROWS. added a 'UNKNOWN_COLUMN' to stop_times.txt of browncounty-mn-us--flex-v2 dataset, run GTFS validator, no UNPARSABLE_ROWS for stop_times.txt.

We're only planning to modify the logic of missing_required_field for Flex feeds, not invalid_url or invalid_timezone. I think we'd proceed by continuing the work in #1721 and see how often these feeds fail to parse files by completing #1775 cc @davidgamez @qcdyx

davidgamez commented 5 months ago

Moving @qcdyx findings from #1770 here:

It's the missing_required_field 'stop_id' that leads to a validation error, which makes stop_times.txt have a status of UNPARSABLE_ROWS. added a 'UNKNOWN_COLUMN' to stop_times.txt of browncounty-mn-us--flex-v2 dataset, run GTFS validator, no UNPARSABLE_ROWS for stop_times.txt.

We're only planning to modify the logic of missing_required_field for Flex feeds, not invalid_url or invalid_timezone. I think we'd proceed by continuing the work in #1721 and see how often these feeds fail to parse by completing #1775 cc @davidgamez @qcdyx

For clarification, when an unparsable error is triggered, it only affects single file validators for the referred file. In this case only agency.txt validators are affected.

emmambd commented 2 months ago

From the findings from #1749, it looks like this is not an issue now that missing_required_field has been modified. cc @jcpitre