Ability to dump validator notice schemas to json or yaml

machow commented 3 years ago

Feature request

Is your feature request related to a problem? Please describe.

Currently, the validator uses very clear schemas for structuring results internally. However, because it dumps results to a semi-structured JSON file, they're easy to read, but difficult to load into structured data sources (like SQL databases).

Proposed solution

Add an option to the CLI to dump a JSON file of notice schemas. Maybe something like this?:

{
    "invalidPhoneNumber": [
        {
            "name": "csvRowNumber",
            "type": "integer"
        },
        ...
    ],
    "suggestedExpirationDate": [
        {
            "name":  "localDate",
            "type": "record",
            "fields": [
                {"name": "year", "type": "integer"},
                ...
            ]
        }
    ],
    ...
}

Where the pattern here is...

{
    "CODE_NAME": [
        SCHEMA_ENTRY
    ],
}

Where SCHEMA_ENTRY is just copying BigQuery's schema format...

# note fields is optional, to describe nested records
{"name": string, "type": string, fields: [SCHEMA_ENTRY...]}

Describe alternatives you've considered

Currently, I'm just inferring the schema from a bunch of report data!

lionel-nj commented 3 years ago

Hi @machow ! thanks for the suggestion - do you see that yaml being generated for each different validation report (and system error report)? Or once and for all for each notices?

machow commented 3 years ago

Hey--sorry for the wait. I was imagining it being generated once, and covering every possible notice. This way people could dump the schema once when setting up a database / dump it as a reference.

lionel-nj commented 3 years ago

Hi @machow! We thought we could build such a schema using SQL (that is more generic) instead of BigQuery; would that meet your needs?

cc @barbeau

machow commented 3 years ago

@lionel-nj I am not sure what that means--can you say more about what building a schema using SQL would look like?

For example, this is what a schema using SQL looks like...

CREATE TABLE table_name (
    column1 datatype,
    column2 datatype,
    column3 datatype,
   ....
);

These statements tend to be database / dialect dependent (esp. the types). You could dump it to json using type names that tools made to translate across dialects (e.g. sqlalchemy) use. That would allow people to construct these schemas across databases.

barbeau commented 3 years ago

@machow Yes, our internal discussions were about trying to generalize a solution to be usable beyond just BigQuery. We're open to suggestions for how exactly that is implemented to be most useful to users across many different database systems.

machow commented 3 years ago

Ah, that makes sense! Here are some more thoughts on the risks of a generic SQL-focused solution, and what seem like some of the core factors here. The one thing I'd emphasize is that my proposal attempts to bootstrap off bigquery's solution for representing these types of schemas, but the value IMO comes from having any reasonable JSON representation of the schema!

I think a simple JSON file will be most useful for these reasons:

Some databases only create json blobs, so won't use this type info (e.g. postgresql, snowflake). AFAIK bigquery is abnormal in letting you define the subfields in a JSON object (they call it a record). This means that the type info in many cases can't be encoded as a SQL column in a table schema
Even when loading a generic json blob, having types is still useful for two downstream cases:
- (SQL) Validating the data after its unnested into a new table
- (pandas) When working outside SQL, validating the unnested data using something like pandera

Key questions for schema:

What types are present in the validator output?
- AFAIK integer, float, double, string, json blob (sub-records)
How should the type schema be represented?
- I like bigquery's approach, since it's guaranteed to be able to represent your data and has been used in a SQL context, but there are probably many ways to skin the "represent this data schema in JSON" cat!
- Ideally it would be simple and easy to read. People could do fancier work downstream.

Example schema, highlighting need to represent sub-records (this is in yaml):

  - name: currentDate
    type: STRING
  - name: feedEndDate
    type: STRING
  - name: suggestedExpirationDate
    type: RECORD
    fields:
      - name: localDate
        type: RECORD
        fields:
          - name: year
            type: INTEGER
          - name: month
            type: INTEGER
          - name: day
                  type: INTEGER

MobilityData / gtfs-validator

Ability to dump validator notice schemas to json or yaml #899

Feature request