MobilityData / gbfs-validator

The canonical GBFS validator. Maintained by the GBFS community, facilitated by MobilityData.
https://gbfs-validator.mobilitydata.org/
Apache License 2.0
18 stars 12 forks source link

Additional errors #106

Closed tdelmas closed 11 months ago

tdelmas commented 1 year ago

Add non-jsonSchema validations

Checked errors and warnings are listed in README.md

Display a condensed summary of errors and warnings grouped by types and paths

Fix #92

Fix #93 by displaying a warning

Exemple: https://deploy-preview-106--gbfs-validator.netlify.app/validator?url=https%3A%2F%2Fdubai.publicbikesystem.net%2Fcustomer%2Fgbfs%2Fv2%2Fgbfs.json

image

netlify[bot] commented 1 year ago

Deploy Preview for gbfs-validator ready!

Name Link
Latest commit eb20162eb6377a2b77bf9dccfdbecdc52b6dc87b
Latest deploy log https://app.netlify.com/sites/gbfs-validator/deploys/64f835537bdeb80007c92200
Deploy Preview https://deploy-preview-106--gbfs-validator.netlify.app/
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

richfab commented 1 year ago

Happy to see the wording nonSchema in the code 👍 Due to limited resources, we will only be able to take a closer look in a few weeks. Please let us know if we are blocking you in any way. Thank you for your patience 😌

richfab commented 1 year ago

Hi @PierrickP and @tdelmas, Thank you very much for this very valuable contribution 🙇

I wonder if we could make the interface more consistent for users when there are both errors and warnings. What do you think about these UI suggestions? (the "After" screenshots are collages, not implemented)

cc @josee-sabourin @isabelle-dr

State Before After (suggestion)
summary collapsed image image
summary expanded image image
isabelle-dr commented 1 year ago

For what it's worth, I am generally in favor of having the errors (schema and non-schema) in one place, because from the user perspective, it's a spec violation, regardless of if generated by the JSON Schema or the custom code.

hbruch commented 1 year ago

Great extension. I just found this PR searching for a check on duplicate_vehicle_type_id, which I encounter currently with this feed, and which could be a nice addition to the nonSchema tests.

PierrickP commented 1 year ago

@richfab @isabelle-dr i regrouped errors together and update the design

@hbruch I've added a new rule (duplicate_vehicle_type_id) to check duplicated vehicle_type_id on vehicle_types.json

tdelmas commented 1 year ago

Thank you all for your detailed review.

I've pushed a commit that apply most suggestions of your reviews, and marked them as "resolved" for readability.

Comments not marked as resolved are not fixed, but as they require more tests, I strongly feel they are more suited in a follow-up pull request.

richfab commented 1 year ago

Thank you @tdelmas for taking most suggestions into consideration. This is amazing!

I will review your changes as soon as possible.

tdelmas commented 1 year ago

@testower thank you for your review.

I'll try to answer your concerns.

None of the new rules are implemented using partials / schema patching, although after skimming through them, I think a lot of them could well have been (?) - is this a strategic decision?

Yes. Logic errors are much more difficult to write using schemas, and even where they are, it leads to some problems and bugs. For example, if two rules needs to patch the same part of a schema, the results will be unpredictable. Or if the original schema changes. Also, most of them are cross-files, which least to reading a file A, then creating the schema for B and vice versa.

What about the rules that are already implemented that way, are they going to be left like this or ported to the imperative style?

I feel like some should, in the future.

The benefit of using the "old" convention is that the output automatically conforms to the regular schema validation error outputs, whereas in the imperative style, they have to be constructed manually. Especially since this project has no type-checking, this is prone to errors.

I agree, this could be improved in a future PR.

There is a breaking change in the response format of the validator api - which prompts me to raise the question of standardizing this format for portability purposes. If validators respond the same way, frontends and validators can be used interchangeably, and validation reports can be consumed later by any "conformant" UI.

That's a good goal.

Last, the most serious issue for me, this PR introduces a number of warnings based on assumptions about a number of things that I don't think a generic / canonical GBFS validator should be concerned with. Business rules that depend on operational context should be left out of the validator.

Warning are just warnings: here to catch the attention, because it caught something that may be an error. I insist on the may.

I strongly feel they are important because they help to catch subtle mistakes, such as prices in cents instead of base units, or meters vs kilometers, because some feeds had those problems.

If this validator was a library, one could imagine a feature that allowed library users to provide their own additional rulesets specific to their operational context - but as it stands now, this is just speculative.

A separation could be done, it's an interesting idea for a future PR

testower commented 1 year ago

Let's take a step back to look at first principles. The GBFS validator is, according to the README

A General Bikeshare Feed Specification dataset validator

furthermore

The Canonical GBFS Validator is a tool to check the conformity of a GBFS feed against the official specification.

The word "canonical" is instructive if taken to mean: "according to the canon", the canon being the "the standard, rule or primary source that is accepted as authoritative"[1]. The "canon" in this context is obviously the GBFS specification.

I will interpret this to mean that the validator is intended to be isomorphic[2] to the specification it validates against. Any departure from that principle should be seriously discussed.

Let me elaborate on why I think it's a bad idea to introduce validation rules that break this isomorphism. My argument is that any validation rule that doesn't have a corollary in the specification, is a business rule only relevant for a specific operational context.

So while it may be useful to get a warning if a car is reported to have room for more than 5 passengers, or a moped has a max range of more than 100km, or a car rental costs more than 50 USD/EUR for an hour (?) - just to name a few examples - these are useful in a specific operational context (perhaps yours?), but it's hard to argue that they are universal truisms about the domain over which the GBFS specification is meant to govern. If they were, they would have been encoded in the specification already.

Now I'm not saying that such rules don't have merit, quite the contrary. But I'm saying that they belong in the specific systems that operate in the specific contexts that give rise to their usefulness. Anywhere else, they create clutter and noise.

For the sake of argument, I shall entertain the idea that these rules should be in the validator. Now, what process or rationale should determine what these rules and their threshold values should be? Is it consensus-based, or democratic? The greatest common denominator of all producers? Who shall bear the burden of maintaining these over time?

As an exercise, imagine that ever more currencies are added to the "typical cost" structure. How is it possible to keep this structure updated and relevant as prices change and currencies inflate and deflate.

I believe that the best candidates for taking that responsibility are those who operate in the context where those rules, thresholds and values are useful and relevant. Offloading that to the canonical validator is a recipe for future decay and ultimately complete irrelevance of the validator as a general-purpose tool to validate against the specification.

If I were to pretend to be positive about accepting these new rules, I would still expect the following preconditions to be met:

Even with these preconditions met, I'm not enthusiastic about them, since they still introduce a maintenance burden. I.e. someone must take responsibility for keeping them up-to-date and relevant, and someone must make sure that the total set of such non-specification rules is kept at a reasonable level. I'm particularly skeptical about the "typical cost" check. For one because it equates USD with EUR which could quickly change - but more importantly because it opens the possibility to add checks for any number of currencies, which, needless to say, would very quickly grow to unmanageable scales. Barring additional currencies from being added to avoid this problem, just amounts to discrimination, so I just don't see a way to accept it in the first place.

Related to the point I made about the output from the validator being standardized, I'm not sure how this fits in. The taxonomy is very unclear and ambiguous about what constitutes a validation error based on the specification and what is a result of an opinionated "sanity check".

All in all, I think this marks a very significant shift in the goal and purpose of the validator that we should not take on lightly. I ask the community to think very carefully about accepting these kinds of changes.

[1] https://en.wikipedia.org/wiki/Canonical [2] Isomorphic in the set-theoretic sense: The set of validation rules should be identical to the set of rules in the specification.

tdelmas commented 1 year ago

So while it may be useful to get a warning if a car is reported to have room for more than 5 passengers, or a moped has a max range of more than 100km, or a car rental costs more than 50 USD/EUR for an hour (?) - just to name a few examples - these are useful in a specific operational context (perhaps yours?)

As I said, those rules are not here to detect "high" or "low" prices, but cases when the producer made a conversion error (between meters vs kilometers, $ vs cents, etc.). But yes, it's not a possible to detect that perfectly.

@testower Is there a list of rules we can remove from this PR, so it could be acceptable for you?

JohanEntur commented 1 year ago

@tdelmas The issue is that there is a severe break in principle between a format-driven validation, and inclusion of subjective detection of possible user errors. It's not about what is acceptable for @testower personally, it's about having a whitelabel validator that works equally well for any user. That is the main point here.

Validation rules (the warnings), like these, do have a role to play but they would work better as a separate validator, perhaps as a template that each user can configure for local use. That way, each user can define their own limitations in weight, size, doors, and prices according to their own needs - but this still has nothing to do with the GBFS standard.

Our intention is not to be troublesome but to ensure that certain principles are maintained.

testower commented 1 year ago

@testower Is there a list of rules we can remove from this PR, so it could be acceptable for you?

Yes, all the warnings.

richfab commented 1 year ago

Hello everyone,

Thank you all for sharing your views on the validator with such enthusiasm and quality arguments. It’s very exciting to see that this tool is getting the attention it deserves from the community.

Thank you @testower for reminding the purpose of the canonical GBFS validator, which is to validate the conformity of a GBFS feed with the specification. MobilityData will make sure to keep this at the forefront of our decisions.

To ensure isomorphism between the specification and the validator, MobilityData recommends continuing to include in the validator only the rules that are in the specification.

The data sanity checks proposed by @tdelmas is a very good idea and could have a place in a separate validator configurable by users, as suggested by @JohanEntur.

Here is MobilityData’s proposal:

  1. The validator should return an error for these rules defined as mandatory in the specification:
Rule Proposal GBFS spec extract
last_updated_future Keep in this PR "Indicates the last time data in the feed was updated."
additional_properties Keep in this PR "Field names of extensions SHOULD be prefixed with an underscore ( _ ) character”
missing_translation Keep in this PR "Translations MUST be provided for each supported language for all translatable fields of type Array"
unknown_language Keep in this PR "Must match one of the values specified by the languages field in system_information.json.”
missing_default_pricing_plan_id Keep in this PR "REQUIRED if system_pricing_plans.json is defined.”
missing_station_information Keep in this PR "Any station that is represented in station_status.json MUST have a corresponding entry in station_information.json.”
missing_station_status Keep in this PR "Any station that is represented in station_information.json MUST have a corresponding entry in station_status.json.”
duplicate_station_id Keep in this PR "Array that contains one object per station as defined below.”
default_reserve_time Keep in this PR "REQUIRED if reservation_price_per_min or reservation_price_flat_rate are defined.”
unclosed_polygon Keep in this PR "The first and last positions are equivalent, and they MUST contain identical values (https://datatracker.ietf.org/doc/html/rfc7946#section-3.1.6)”
duplicate_bike_id Keep in this PR "Array that contains one object per vehicle”
duplicate_vehicle_id Keep in this PR "Array that contains one object per vehicle”
invalid_vehicle_type_id Keep in this PR "The vehicle_type_id of this vehicle, as described in vehicle_types.json.”
duplicate_vehicle_type_id Keep in this PR "Array that contains one object per vehicle type”
unknown_plan_id, missing_system_pricing_plans and invalid_pricing_plan_id Combine the 3 alerts as they serve the same purpose "A plan_id, as defined in system_pricing_plans.json”
  1. The validator should return a warning for these recommendations defined in the specification:
Rule Proposal GBFS spec extract
last_updated_outdated Return a warning instead of an error "in no case should it be more than 5 minutes out-of-date”
num_vehicles_available_incorrect Return a warning instead of an error "The total number of vehicles from each of these objects SHOULD add up to match the value specified in the num_vehicles_available field.”
num_docks_available_incorrect Return a warning instead of an error "The total number of docks from each of these objects SHOULD add up to match the value specified in the num_docks_available field.”
  1. The validator should not return any error or warning for something that is not currently part of the specification: duplicate_translation, duplicate_vehicle_accessory, multiple_door_counts, num_docks_available_high, num_vehicles_available_high, station_capacity_too_low, unexpected_rider_capacity, unexpected_cargo_volume_capacity, unexpected_cargo_load_capacity, unexpected_propulsion_type, unexpectedly_low_range_meters, unexpectedly_high_range_meters, unexpected_vehicle_accessory, unexpectedly_low_wheel_count, unexpectedly_high_wheel_count, low_cost, high_cost. We suggest creating a separate issue to keep a list of potential errors/warnings that would be outside of the specification where others can add so that we have a good basis when/if we start to add data sanity checks to the validator.

Please let us know what you all think. Thank you!

testower commented 1 year ago

@richfab I think that's a very good solution and it answers my concerns 🥇 if these changes are made I think we can move on to the more technical aspects of the review

richfab commented 1 year ago

Hi @tdelmas, please let me know if you have any questions or concerns about the proposal. Thank you!

tdelmas commented 1 year ago

Thank you for the feedbacks and your ping.

richfab commented 1 year ago

Thank you @tdelmas and @testower. Since consensus was reached, I am planning to work on removing the code related to data sanity rules from this PR next week. The code will not be lost as it will remain in the commits.

The discussion about data sanity rules was moved to https://github.com/MobilityData/gbfs-validator/issues/138.

Thank you!

richfab commented 1 year ago

if these changes are made I think we can move on to the more technical aspects of the review

@testower @tdelmas A technical review would be very welcome in https://github.com/MobilityData/gbfs-validator/pull/141, if you have the bandwidth.

Thank you in advance for making the GBFS validator even more useful to the community 🙌