Closed jcpitre closed 1 month ago
Tasks:
See here for a sheet with all the validators we use.
SingleEntityValidators:
boolean shouldCallValidate(GtfsEntityContainer, NoticeContainer)
@GtfsValidator(skipIfTheseColumnsAbsent = {"is_producer", "is_operator", "is_authority"})
FileValidators
boolean shouldCallValidate(NoticeContainer)
Thanks, @jcpitre; this is a very detailed proposal. I support the idea of implementing a shouldCallValidate. As this method depends on the information that the validator posses after its creation and not an specific row; I suggest calling this method instead of before validating each row, before adding the validator to the executable validator list. This will reduce the amount of calls as will be one per file instead of one per row. In addition, we can have a report of the skipped validators due to missing columns similar to the one we have with the unparsable rows.
Describe the problem
In #1749 we got to a point where one of the datasets was so big that we got out of memory issues and the time to run increased significantly. We could check if a column exists in a file and not run validators related to that column. In particular for foreign key validator, if the column that has the annotation does not exist, don't run the validator. This could be useful in particular for stop_times, that usually has the most number of records and, with the addition of flex, has now 5 fields with the ForeignKey annotation.
Note: #1747 tackles the same problem, but is much broader in scope.
Proposed solution
See above The exact mechanism TBD
Alternatives you've considered
No response
Additional context
No response