MobilityData / gtfs-realtime-validator

Java-based tool that validates General Transit Feed Specification (GTFS)-realtime feeds
Other
42 stars 10 forks source link

Clarify batch processing docs and behavior in context of multiple files, cross message rules (E047) #100

Open isabelle-dr opened 2 years ago

isabelle-dr commented 2 years ago

Issue by evansiroky Feb 10, 2022 Originally opened as https://github.com/CUTR-at-USF/gtfs-realtime-validator/issues/411


Summary:

I looked in the code and realized that some validation rules look back at previous messages, and so it occurred to me that this implies validating from the same stream of RT file types.

Steps to reproduce:

If multiple file types are validated at the same time, they might all have certain header timestamps that could result in certain timestamp validation rules being triggered or not being triggered.

Expected behavior:

The batch file documentation should recommend validating only one RT file type at a time.

Observed behavior:

The batch file documentation does not recommend validating only one RT file type at a time.

Platform:

https://github.com/cal-itp/gtfs-rt-validator-api/

isabelle-dr commented 2 years ago

Comment by barbeau Feb 11, 2022


@evansiroky Thanks for pointing this out. I agree that the time dependency of some rules could be called out more explicitly.

IIRC the basic assumption we made was that all files in a directory being validated would come from the same feed stream. So if you're mixing feeds (e.g., different .pb file sources) in the same directory it could cause issues.

Also, note that there is a -sort parameter that controls if the file name or date is used as the "current" time for these rules: https://github.com/CUTR-at-USF/gtfs-realtime-validator/tree/master/gtfs-realtime-validator-lib#command-line-config-parameters.

Also FYI, the validator should support mixed feeds, where you have multiple entity types in the same PB file (e.g., VehiclePosition and TripUpdates).

evansiroky commented 2 years ago

Looking at this further, I am wondering how this plays into the calculation of E047 errors. I'm confused about how this is all supposed to work in a batch context now.

barbeau commented 2 years ago

@evansiroky I agree that the logic of of the cross-message validation needs to be reviewed in context of the batch processor.

From what I recall, when you're using the webapp and you enter multiple URLs, all the entity objects (TUs, VPs) from both URLs get aggregated into a single list that's validated for all the rules, including the cross-message validation like E047.

In batch processing, only one RT file is read at a time into that same entity list, so unless you have mixed entity types in the same file (i.e., VPs and TUs at the same endpoint URL) you'll never see E047.

We should verify what the current behavior is and document it appropriately, and possibly allow dynamic merging of multiple files (e.g., with same timestamp in file name? That probably isn't realistic...) so that files from multiple feed endpoints (URL for TUs, URL for VPs) could be verified together for rules like E047 in batch mode. I updated the issue title to reflect this.

@evansiroky @e-lo It would be useful to better understand what your file system looks like for archiving files from multiple feed endpoints (TUs, VPs) for a single provider and how you think the batch processor command-line options or other configuration could be set up in your use case to run E047 and similar cross-entity rules.