MobilityData / gtfs-validator

Canonical GTFS Validator project for schedule (static) files.
https://gtfs-validator.mobilitydata.org/
Apache License 2.0
278 stars 100 forks source link

New Validation Rule: Number of Trips per Day(s) #1137

Open dancesWithCycles opened 2 years ago

dancesWithCycles commented 2 years ago

Hi folks, Thank you so much for providing and maintaining this repository. Chapeau!

A transport authority is looking for a validator, or even better the extension of an existing validator, that is best suited to add one of their use cases. What validator is funded best to suit their purpose? This use case is explained here now but is also related to issue 1117.

What problem in GTFS datasets does this new rule address? Please describe.

A passenger information system (PIS) is based on GTFS for static transit data. On a regular basis, the PIS is not providing most of the trips closer to end_date of calendar.txt. The reason is a great number of exceptions from calendar_dates.txt close to end_date of calendar.txt. As the consequence, the authority is asked to create another GTFS file not based on end_date of calendar.txt but when the number of trips per day is falling under a certain threshold.

For the mentioned transport authority, a GTFS file with a certain number of days with an amount of trips lower than the threshold indicates low data quality. That observation would trigger the creation of another GTFS file.

Describe the new validation rule A GTFS file is invalid when one day or a configurable number of days between start_date and end_date of calendar.txt has a number of trips of trips.txt smaller than a configurable threshold.

Error vs warning I am neither an expert of GTFS spec nor of the best practices. If the result of this rule is an error, info or warning might depend on perspective and a topic of discussion. It would probably be an error from the perspective of the mentioned transit authority.

I appreciate any hint in the right direction!

Cheers!

derhuerst commented 2 years ago

Another possible metric would be the ratio of trips running vs those "cancelled" by calendar_dates.txt.

barbeau commented 2 years ago

Another possible metric would be the ratio of trips running vs those "cancelled" by calendar_dates.txt.

👍 If calculated per service day, this would also help catch the common mistake where an agency cancels service for a holiday but forgets or incorrectly configures the replacement service, resulting in no scheduled service (as far as the GTFS dataset shows) on the holiday.

dancesWithCycles commented 2 years ago

Another possible metric would be the ratio of trips running vs those "cancelled" by calendar_dates.txt.

Personally, I have in mind a calculation per service days according to the trip_id -> service_id -> service interval start_date to end_date of calendar.txt. Somehow, I like to know not a single day but a number of days where the ratio of trips with normal service vs. removed service indicates a mistake or a low quality in GTFS data.

Any thoughts if this validator is suited to be extended with such a rule?

Cheers!

derhuerst commented 2 years ago

Personally, I have in mind a calculation per service days according to the trip_id -> service_id -> service interval start_date to end_date of calendar.txt. Somehow, I like to know not a single day but a number of days where the ratio of trips with normal service vs. removed service indicates a mistake or a low quality in GTFS data.

I don't quite understand what exactly you're describing here. The amount/ratio of running (as in non-"cancelled") trips over the whole (start_date, end_date) period of the server, for each service/trip combination?

dancesWithCycles commented 2 years ago

Personally, I have in mind a calculation per service days according to the trip_id -> service_id -> service interval start_date to end_date of calendar.txt. Somehow, I like to know not a single day but a number of days where the ratio of trips with normal service vs. removed service indicates a mistake or a low quality in GTFS data.

I don't quite understand what exactly you're describing here. The amount/ratio of running (as in non-"cancelled") trips over the whole (start_date, end_date) period of the server, for each service/trip combination?

@derhuerst You are right, the more time I spend on this matter, the more use cases are coming up.

  1. You could state a maximum 'allowed' number of cancelled days on which trips are not offered during the service period. This might be an interesting investigation.
  2. You could be interested in the number of cancelled days that make it happen that trips are effectively running out of service before the overall defined end of service period stated in calendar.txt.

The transit authority I have in mind might be more interested in the later case. When you know the service period of the current GTFS data ends at the end of next month, you might think you have plenty of time to get hold of a new GTFS data feed.

However, when you learn that the current GTFS data does not provide schedule information for certain trips starting tomorrow (due to exceptions in calendar_dates), you might panic on how to get the missing data already today as it does not correspond with the official and public schedule. This situation arises from a poorly created GTFS data feed.

Does this example makes the use case more clear?

isabelle-dr commented 1 year ago

Labeling as a "community rule" because we don't have an explicit mention in the spec or best practices. This validator contains a few rules that aren't clearly mentioned in the spec or best practice, because the community sees them as highly valuable (fast travel, for example). We are in favor of modifying the specification first before adding this type of check in the validator, in order to keep both aligned.