Optimisation: Do not run validators on columns that are not present

jcpitre commented 2 months ago

Describe the problem

In #1749 we got to a point where one of the datasets was so big that we got out of memory issues and the time to run increased significantly. We could check if a column exists in a file and not run validators related to that column. In particular for foreign key validator, if the column that has the annotation does not exist, don't run the validator. This could be useful in particular for stop_times, that usually has the most number of records and, with the addition of flex, has now 5 fields with the ForeignKey annotation.

Note: #1747 tackles the same problem, but is much broader in scope.

Proposed solution

See above The exact mechanism TBD

Alternatives you've considered

No response

Additional context

No response

qcdyx commented 2 months ago

Tasks:

[ ] add the logic to the post processor
[ ] add a condition to the foreign key validator - if column doesn't exist, don't run
[ ] add tests
[ ] review acceptance tests

jcpitre commented 1 month ago

See here for a sheet with all the validators we use.

SingleEntityValidators:
- For SingleEntityValidators, the validator is called for every entity in the table being processed.
- Not calling the validator could lead to significant time saving for the files that have typically a lot of records, e.g. stop_times.txt, shapes.txt, etc.
- If a validator uses a field that is required, then there is no point to add code to skip validation since missing a required column means that the other validators are not called for that file.
- For custom validators (not generated)
  - e.g.: AttributionWithoutRoleValidator:
  - My favoured solution is to add a method to SingleEntityValidator named
    - boolean shouldCallValidate(GtfsEntityContainer, NoticeContainer)
  - that would be called in a preliminary phase and would determine if the validate method should be called for each entity.
  - Each child of SingleEntityValidator could override the method if necessary.
  - Another possibility:
    - We could add a parameter to the GtfsValidator annotation that would list the columns for which it's OK not to call the validator if the column is not present.
    - e.g. for AttributionWithoutRoleValidator, the annotation could look like this:
      - @GtfsValidator(skipIfTheseColumnsAbsent = {"is_producer", "is_operator", "is_authority"})
    - But this is more obscure and not as versatile as the first solution (with the added method)
- For generated validators:
  - Validators are generated when fields in the schema are annotated with:
    - MixedCase
    - EndRange
    - CurrencyAmount
    - LatLon
  - Since each annotation has an associated validator generator (e.g. MixedCaseValidatorGenerator.java we could modify these to add the proper shouldCallValidate() method to the generated validator class.
  - Alternatively, we could use the skipIfTheseColumnsAbsent parameter to the GtfsValidator annotation, but this is not ideal for the same reasons as the custom validators.
FileValidators
- We could have a similar method that was is proposed for SingleEntityValidator, something like:
  - boolean shouldCallValidate(NoticeContainer)
- but frankly this is not as critical since the first lines of the validate method could do the same task (i.e. just return if the actual validation does not need to be done)

davidgamez commented 1 month ago

Thanks, @jcpitre; this is a very detailed proposal. I support the idea of implementing a shouldCallValidate. As this method depends on the information that the validator posses after its creation and not an specific row; I suggest calling this method instead of before validating each row, before adding the validator to the executable validator list. This will reduce the amount of calls as will be one per file instead of one per row. In addition, we can have a report of the skipped validators due to missing columns similar to the one we have with the unparsable rows.

MobilityData / gtfs-validator