MobilityData / gtfs-validator

Canonical GTFS Validator project for schedule (static) files.
https://gtfs-validator.mobilitydata.org/
Apache License 2.0
290 stars 101 forks source link

Maximize the amount of notices that can be checked when there is a parsing problem. #1484

Open isabelle-dr opened 1 year ago

isabelle-dr commented 1 year ago

Currently, where there is one parsing problem in the data, none of the multi-file validators run.

This creates issues such as https://github.com/MobilityData/gtfs-validator/issues/1096 or https://github.com/MobilityData/gtfs-validator/issues/1167.

We want to optimize this logic so that only the validators that are dependent on the data being properly formatted don't run. For example, route_color_contrast in dependent on the color being properly formatted (or on invalid_color not being triggered). If a color is not properly formatted, we only want the validator that triggersroute_color_contrast to not run, as opposed to all the multi-file validators.

isabelle-dr commented 1 year ago

Related to https://github.com/MobilityData/gtfs-validator/issues/1485

bdferris-v2 commented 1 year ago

Hey, this issue has recently become more urgent for me. I'd be interested in contributing some cycles to finding a solution here, especially one that allows a maximal number of validators to run.

bdferris-v2 commented 1 year ago

@davidgamez and I had discussed in the past the idea of conditionally running more multi-file validators if their underlying data dependencies don't have parse errors. I've got an initial implementation of that approach in PR #1496.

Here, if a FileValidator has an injected dependency on a GTFS table that has parse errors, then the validator still wouldn't run, because the underlying table might be missing data that would cause spurious additional errors (e.g. foreign key reference validation). However, if all the injected dependencies are ok, then we can still run the validator.

I think this approach strikes a reasonable balance between running more validators without having to do potentially significant engineering to run all validators (e.g. making each validator resilient to invalid underlying data).

Thoughts?