MobilityData / gtfs-validator

Canonical GTFS Validator project for schedule (static) files.
https://gtfs-validator.mobilitydata.org/
Apache License 2.0
288 stars 101 forks source link

False positives for StopTimeTimepointWithoutTimesNotice #954

Closed mcplanner-zz closed 2 years ago

mcplanner-zz commented 3 years ago

Bug report

Describe the bug The StopTimeTimepointWithoutTimesNotice error is displayed when the stop_times.timepoint field is blank. This is unexpected. I would expect this to be interpreted as having no timepoints.

How we reproduce the bug Steps to reproduce the behaviour: This happened with this field: http://iportal.sacrt.com/GTFS/Unitrans/google_transit.zip

Expected behaviour If the timepoint field isn't being used, I expect an error about it's being missing.

Observed behaviour I was alerted that every timepoint was lacking an arrival and departure time.

Screenshots:

Environment versions Java 11, v2.0.0 validator, https://github.com/cal-itp/gtfs-validator

lionel-nj commented 3 years ago

Thanks @mcplanner for opening.

The StopTimeTimepointWithoutTimesNotice error is displayed when the stop_times.timepoint field is blank. This is unexpected. I would expect this to be interpreted as having no timepoints.

The GTFS specification says that if stop_times.timepoint = 1 or empty then the times are considered exact. From my understanding a stop time with no value for timepoint and no value for stop_times.arrival_time or stop_times.departure_time should issue a StopTimeTimepointWithoutTimesNotice.

See stop_times.txt reference.

If the timepoint field isn't being used, I expect an error about it's being missing.

I am not sure if a notice should be generated here, the specification handles the case where no value is attributed to stop_times.timepoint: default value is 1.

Please feel free to close this issue, if that answers your interrogations.

barbeau commented 3 years ago

Here's example data from the above GTFS file:

trip_id arrival_time departure_time stop_id stop_sequence stop_headsign pickup_type drop_off_type shape_dist_traveled
784678 07:25:00 07:25:00 22256 1
784678 22239 2
784678 07:28:00 07:28:00 22361 3
784678 22225 4

tl;dr - This data is fine and the validator shouldn't be logging any errors. The reason is that this data conforms to the original GTFS spec prior to the timepoint field existing (timepoint is optional). The original GTFS spec said that you should only provide times for records that were timepoints. So in the above data stop_sequence 1 and 3 are timepoints, and 2 and 4 are not.

IMHO the timepoint definition really needs to be clarified in the spec as it currently doesn't communicate this concept clearly - see https://github.com/google/transit/issues/61.

@MobilityData/transit-specs

barbeau commented 3 years ago

If the timepoint field isn't being used, I expect an error about it's being missing.

The GTFS Best Practices (http://gtfs.org/best-practices/#stop_timestxt) say:

The timepoint field should be provided. It specifies which stop_times the operator will attempt to strictly adhere to (timepoint=1), and that other stop times are estimates (timepoint=0).

Sticking to the canonical validator definition, timepoint field missing couldn't be an error, but could certainly be a warning.

lionel-nj commented 3 years ago

Thanks for the clarification @barbeau! Following your suggestion, I agree that the spec is not clear on this particular point. I'll keep this issue open so that we can fix this bug.

Does the following seems reasonable?

  1. Identify if stop_times.txt comes from a legacy dataset: does the header contain column timepoint or not? This means that we'll have to modify the core module to include the boolean field hasTimepointColumn (or similar) boolean to GtfsStopTimeTableContainer.
  2. Remove defaut value 1 from GtfsStopTimeSchema for field timepoint: https://github.com/MobilityData/gtfs-validator/blob/464ff36316162e71d7f154e64e00d87b21f91dd4/main/src/main/java/org/mobilitydata/gtfsvalidator/table/GtfsStopTimeSchema.java#L77
  3. Issue a notice (warning) if a row from stop_times.txt has value timepoint=1 and no value for stop_times.arrival_time and/or stop_times.departure_time.
barbeau commented 3 years ago

Here are the cases to cover.

For all stop_times.txt files:

If timepoint column does not exist:

If timepoint column exists:

There is also the case where a stop_times.txt file does not have the timepoint column, but all records have arrival and departure times populated. Technically this was against the original GTFS spec, but it became a common practice, and therefore we shouldn't consider it an error because it will invalidate a lot of old datasets. In this case we'll emit a warning anyway that timepoint should be added.

Does that make sense?

lionel-nj commented 3 years ago

Here are the cases to cover.

For all stop_times.txt files:

  • First and last record for each trip must have arrival and departure time, else ERROR

✅ Covered in MissingTripEdgeValidator.

  • If arrival time or departure time is populated for a record, then both arrival time AND departure time must be populated for that record, else ERROR

✅ Covered in StopTimeArrivalAndDepartureTimeValidator.

If timepoint column does not exist:

  • Emit WARNING that timepoint column should be added

If timepoint column exists:

  • All records must have 0 or 1 value in timepoint column, else ERROR
  • If timepoint==1, arrival and departure time must both be populated, else ERROR

There is also the case where a stop_times.txt file does not have the timepoint column, but all records have arrival and departure times populated. Technically this was against the original GTFS spec, but it became a common practice, and therefore we shouldn't consider it an error because it will invalidate a lot of old datasets. In this case we'll emit a warning anyway that timepoint should be added.

TimepointTimeValidator should be modified to this extent.

Does that make sense?

Yes, @isabelle-dr should the specification be clarified first or could these changes be implemented before?

isabelle-dr commented 3 years ago

Hey all, thanks for flagging and looking into this.

The case of legacy data with no timepoint column (prior to the timepoint field existing) was flagged when we initially introduced that rule.

I agree that the root problem is that the specification contains two contradictory statements about the meaning of an "empty" value in stop_times.timepoint

  1. in stop_times.timepoint:

1 or empty - Times are considered exact.

  1. in stop_times.departure_time and stop_times.arrival_time:

Conditionally Required: Required for timepoint=1. Optional otherwise.

The former could mean thatstop_times.timepoint == 1 is equivalent than stop_times.timepoint == "". The latter could mean that the times are required when stop_times.timepoint == 1 and are not required when stop_times.timepoint == "". If we consider that an empty timepoint column is equivalent to empty values in all stops_times.timepoint fields, all legacy datasets become invalid, which is how it was coded in the validator and what @mcplanner-zz is flagging here. She also points out how counter-intuitive it is that having no values for timepoint is considered a timepoint.

This is extremely well described in google/transit/issues/61. The "empty" word in the first statement seems to actually refer to an empty timepoint column which maintains the backwards compatibility.

How should the validator work?

If timepoint column does not exist:

Emit WARNING that timepoint column should be added

Absolutely, and this will be a new notice based on the Best Practices, that we should add at some point.

If timepoint column exists:

All records must have 0 or 1 value in timepoint column, else ERROR If timepoint==1, arrival and departure time must both be populated, else ERROR

I would only implement the second statement proposed here (at least for now) because the first is stricter than the specification: it would make the following data invalid, and there is nothing in spec currently that would justify that. stop_sequence arrival_time departure_time timepoint
1 00:00:00 00:00:00 1
2
3
4 00:10:00 00:10:00 1

@lionel-nj I'd say we're good to make the changes in the validator to maintain the backwards compatibility (second statement proposed by Sean, and your initial idea), but not upgrading this rule to error, this will be done in the PR we already have.

isabelle-dr commented 3 years ago

Here are some examples of what would change, @barbeau does this seem right?

Case 1: legacy dataset

Current behaviour: The notice is emitted for stops 2 & 3 Expected behaviour: No notice stop_sequence arrival_time departure_time
1 00:00:00 00:00:00
2    
3    
4 00:10:00 00:10:00

Case 2: dataset with timepoint column, filled

Current behaviour: No notice Expected behaviour: No notice stop_sequence arrival_time departure_time timepoint
1 00:00:00 00:00:00 1
2 00:02:00 00:02:00 0
3 00:08:00 00:08:00 0
4 00:10:00 00:10:00 1

Case 3: invalid data

Current behaviour: The StopTimeTimepointWithoutTimesNotice is emitted for stop 2 Expected behaviour: The StopTimeTimepointWithoutTimesNotice is emitted for stop 2 stop_sequence arrival_time departure_time timepoint
1 00:00:00 00:00:00 1
2 1
3 00:08:00 00:08:00 0
4 00:10:00 00:10:00 1

Case 4 (edge case): dataset with timepoint columns, only 1's, all times populated

Current behaviour: No notice Expected behaviour: No notice stop_sequence arrival_time departure_time timepoint
1 00:00:00 00:00:00 1
2 00:02:00 00:02:00
3 00:08:00 00:08:00
4 00:10:00 00:10:00 1

Case 5 (edge case): dataset with timepoint column, one 1's and some empty times

Current behaviour: The StopTimeTimepointWithoutTimesNotice is emitted for stops 2 and 3 Expected behaviour: No notice stop_sequence arrival_time departure_time timepoint
1 00:00:00 00:00:00 1
2
3
4 00:10:00 00:10:00 1
barbeau commented 3 years ago

@isabelle-dr Cases 1, 2, and 3 LGTM 👍

I would only implement the second statement proposed here (at least for now) because the first is stricter than the specification: it would make the following data invalid, and there is nothing in spec currently that would justify that.

So Cases 4 and 5 fall into this category. I agree that the current language in the spec is fuzzy around this, although the original intent when introducing the timepoint column was to specify all or no values - obviously in hindsight it would have been better if it said that explicitly.

Do we know if there are real datasets in the wild that are partially populating timepoint like Case 4 and 5? If so, then I agree that the spec probably needs to be clarified before classifying this as an error. If not, IMHO it could go either way, and IMHO it should be an error.

isabelle-dr commented 3 years ago

@barbeau here are some interesting edge cases with partial implementations, with some real data samples.

stop_sequence arrival_time departure_time timepoint
0 13:15:00 13:15:00
1 13:30:00 13:30:00 0
2 13:40:00 13:40:00 0
3 13:50:00 13:50:00
4 14:00:00 14:00:00 0

They seem to have assumed that since 1 or empty - Times are considered exact. (stop_times.timepoint description), the empty values mean the times are exact. This data doesn't contain any 1's in the timepoint column.

stop_sequence arrival_time departure_time timepoint
1 8:30:00 8:30:00 1
2 8:31:01 8:31:01 1
3
4
5
6
7
8
9
10 8:45:00 8:45:00 1
11 8:55:00 8:55:00 1

Here, they seem to have assumed that since times are Conditionally Required: Required for timepoint=1. Optional otherwise.(stop_times.departure_time and stop_times.arrival_time description), the empty values mean the times aren't exact. This data doesn't contain any 0's in the timepoint column.

trip_id stop_sequence arrival_time departure_time timepoint
1167963915753602:T1218:39:00 1 18:39:00 18:39:00 1
1167963915753602:T1218:39:00 2 18:42:27 18:42:27 0
1167963915753602:T1218:39:00 3 18:43:21 18:43:21 0
1167963915753602:T1218:39:00 4 18:44:36 18:44:36 0
1167963915753602:T1218:39:00 5 18:45:38 18:45:38 0
1167963915753602:T1218:39:00 6 18:47:34 18:47:34 0
1167963915753602:T1218:39:00 7 18:48:00 18:48:00 1
1167963915753602:T1218:39:00 8 18:50:37 18:50:37 0
1167963915753602:T1218:39:00 9 18:52:00 18:52:00 1
kcc_toCleElum_weekday_T01 1 5:30:00 5:30:00
kcc_toCleElum_weekday_T01 2
kcc_toCleElum_weekday_T01 3
kcc_toCleElum_weekday_T01 4
kcc_toCleElum_weekday_T01 5
kcc_toCleElum_weekday_T01 6
kcc_toCleElum_weekday_T01 7 6:10:00 6:15:00
kcc_toCleElum_weekday_T01 8
kcc_toCleElum_weekday_T01 9 6:20:00 6:20:00

This part of the data is at the end of the dataset. The timepoint values have been added for most of the lines, but after this point, all timepoint values are blank. Does this mean all times are exact? Or approximate? I believe neither, the values just haven't been added.

There are also datasets that added the timepoint column but all values are blank, and it seems obvious that it doesn't mean all values are timepoints, see the Via Rail Canada dataset below:

stop_sequence arrival_time departure_time timepoint
1 3:30:00 3:30:00
2
3 4:07:00 4:07:00
4
5
6
7 5:01:00 5:01:00
8 5:12:00 5:12:00
9 5:20:00 5:20:00
10
barbeau commented 3 years ago

😱

Yikes - so there is real-world data all over the place. It would be nice to get this clarified in the spec ASAP so we can implement the original intent in the validator and get these datasets cleaned up. There seemed to be clear agreement on how the data was intended to be represented in the original proposal IIRC, the wording just apparently wasn't clear enough.

lionel-nj commented 3 years ago

As agreed with @isabelle-dr and @barbeau, we will implement the following solutions:

  1. Emit a notice when timepoint = 1 AND times are missing.

This rule will consider legacy datasets as valid, and the validator will behave as follows:

sample data 1 (no timepoint column)

stop_sequence arrival_time departure_time
1 00:00:00 00:00:00
2    
3    
4 00:10:00 00:10:00

No notice

sample data 2 (timepoint column + no empty values)

stop_sequence arrival_time departure_time timepoint
1 00:00:00 00:00:00 1
2 00:02:00 00:02:00 0
3 00:08:00 00:08:00 0
4 00:10:00 00:10:00 1

No notice

sample data 3 (timepoint with no times)

stop_sequence arrival_time departure_time timepoint
1 00:00:00 00:00:00 1
2 1
3 00:08:00 00:08:00 0
4 00:10:00 00:10:00 1

A notice will be generated for the 2nd row.

*sample data 4 (approximate, with times)

stop_sequence arrival_time departure_time timepoint
1 00:00:00 00:00:00 1
2 00:02:00 00:02:00
3 00:08:00 00:08:00
4 00:10:00 00:10:00 1

No notice

sample data 5 (approximate, with no times)

stop_sequence arrival_time departure_time timepoint
1 00:00:00 00:00:00 1
2
3
4 00:10:00 00:10:00 1

No notice

⚠️ Note that the notice will be considered as a WARNING, the severity will be changed in a separate PR to better assess the potential breaking changes.

  1. Emit a notice (WARNING) when stop_times.timepoint value is missing for a single record.
  2. Emit a notice (WARNING) when column stop_times.timepoint is missing (based on best practices)
lionel-nj commented 3 years ago

@mcplanner-zz could you provide an URL to download the dataset you mentionned in the issue description please? I tried download the dataset linked to the URL you provided but could not: the operation timed out with error code ERR_SOCKET_NOT_CONNECTED.