blackrock / ingen

InGen is a command line tool written on top of pandas and great_expectations to perform small scale data transformations and validations without writing code.
Apache License 2.0
13 stars 5 forks source link

Config file schema validation #10

Open shpiyu opened 11 months ago

shpiyu commented 11 months ago

Is your feature request related to a problem? Please describe. The config file (aka metadata file) drives InGen. It is essential for the success of this project that this file is easy to write and understand. It is also important that the schema of the config file is versioned, so that users don't have to update their configs whenever a breaking change is introduced. Therefore, we should add versioning and schema validation to InGen.

Describe the solution you'd like The config file is a YAML file which can be validated using jsonschema. A python library of the same name provides a way to do so. The following code snippet shows a schema that can validate any InGen config which has sources and interfaces defined. This example schema only validates the type of the required outer fields of a config file but shows the possibility of writing a comprehensive schema. Such a schema will ensure that any errors in the config file is detected early and the program fails fast if required. Writing the schema will also uncover some of the complexities of the current config design which can be improved in future versions.

---
title: Interface Generator Metadata Schema
description: This is a sample schema that validates an InGen config
type: object
properties:
  interfaces:
    type: object
    patternProperties:
      ".*":
        type: object
        title: interface
        properties:
          sources:
            title: sources
            description: list of source identifiers
            type: array
          columns:
            title: columns
            description: list of output columns and any formatter applied to them
            type: array
          output:
            title: output
            description: properties of the interface output
            type: object
  sources:
    type: array

Describe alternatives you've considered No alternatives considered yet.