MobilityData / gtfs-validator

Canonical GTFS Validator project for schedule (static) files.
https://gtfs-validator.mobilitydata.org/
Apache License 2.0
288 stars 101 forks source link

Optimisation: Do not run validators on columns that are not present #1839

Closed jcpitre closed 1 month ago

jcpitre commented 2 months ago

Describe the problem

In #1749 we got to a point where one of the datasets was so big that we got out of memory issues and the time to run increased significantly. We could check if a column exists in a file and not run validators related to that column. In particular for foreign key validator, if the column that has the annotation does not exist, don't run the validator. This could be useful in particular for stop_times, that usually has the most number of records and, with the addition of flex, has now 5 fields with the ForeignKey annotation.

Note: #1747 tackles the same problem, but is much broader in scope.

Proposed solution

See above The exact mechanism TBD

Alternatives you've considered

No response

Additional context

No response

qcdyx commented 2 months ago

Tasks:

jcpitre commented 1 month ago

See here for a sheet with all the validators we use.

davidgamez commented 1 month ago

Thanks, @jcpitre; this is a very detailed proposal. I support the idea of implementing a shouldCallValidate. As this method depends on the information that the validator posses after its creation and not an specific row; I suggest calling this method instead of before validating each row, before adding the validator to the executable validator list. This will reduce the amount of calls as will be one per file instead of one per row. In addition, we can have a report of the skipped validators due to missing columns similar to the one we have with the unparsable rows.