MobilityData / gtfs-validator

Canonical GTFS Validator project for schedule (static) files.
https://gtfs-validator.mobilitydata.org/
Apache License 2.0
287 stars 100 forks source link

Ability to dump validator notice schemas to json or yaml #899

Closed machow closed 3 years ago

machow commented 3 years ago

Feature request

Is your feature request related to a problem? Please describe.

Currently, the validator uses very clear schemas for structuring results internally. However, because it dumps results to a semi-structured JSON file, they're easy to read, but difficult to load into structured data sources (like SQL databases).

Proposed solution

Add an option to the CLI to dump a JSON file of notice schemas. Maybe something like this?:

{
    "invalidPhoneNumber": [
        {
            "name": "csvRowNumber",
            "type": "integer"
        },
        ...
    ],
    "suggestedExpirationDate": [
        {
            "name":  "localDate",
            "type": "record",
            "fields": [
                {"name": "year", "type": "integer"},
                ...
            ]
        }
    ],
    ...
}

Where the pattern here is...

{
    "CODE_NAME": [
        SCHEMA_ENTRY
    ],
}

Where SCHEMA_ENTRY is just copying BigQuery's schema format...

# note fields is optional, to describe nested records
{"name": string, "type": string, fields: [SCHEMA_ENTRY...]}

Describe alternatives you've considered

Currently, I'm just inferring the schema from a bunch of report data!

lionel-nj commented 3 years ago

Hi @machow ! thanks for the suggestion - do you see that yaml being generated for each different validation report (and system error report)? Or once and for all for each notices?

machow commented 3 years ago

Hey--sorry for the wait. I was imagining it being generated once, and covering every possible notice. This way people could dump the schema once when setting up a database / dump it as a reference.

lionel-nj commented 3 years ago

Hi @machow! We thought we could build such a schema using SQL (that is more generic) instead of BigQuery; would that meet your needs?

cc @barbeau

machow commented 3 years ago

@lionel-nj I am not sure what that means--can you say more about what building a schema using SQL would look like?

For example, this is what a schema using SQL looks like...

CREATE TABLE table_name (
    column1 datatype,
    column2 datatype,
    column3 datatype,
   ....
);

These statements tend to be database / dialect dependent (esp. the types). You could dump it to json using type names that tools made to translate across dialects (e.g. sqlalchemy) use. That would allow people to construct these schemas across databases.

barbeau commented 3 years ago

@machow Yes, our internal discussions were about trying to generalize a solution to be usable beyond just BigQuery. We're open to suggestions for how exactly that is implemented to be most useful to users across many different database systems.

machow commented 3 years ago

Ah, that makes sense! Here are some more thoughts on the risks of a generic SQL-focused solution, and what seem like some of the core factors here. The one thing I'd emphasize is that my proposal attempts to bootstrap off bigquery's solution for representing these types of schemas, but the value IMO comes from having any reasonable JSON representation of the schema!

I think a simple JSON file will be most useful for these reasons:

Key questions for schema:

Example schema, highlighting need to represent sub-records (this is in yaml):

  - name: currentDate
    type: STRING
  - name: feedEndDate
    type: STRING
  - name: suggestedExpirationDate
    type: RECORD
    fields:
      - name: localDate
        type: RECORD
        fields:
          - name: year
            type: INTEGER
          - name: month
            type: INTEGER
          - name: day
                  type: INTEGER