InGen is a command line tool written on top of pandas and great_expectations to perform small scale data transformations and validations without writing code.
Is your feature request related to a problem? Please describe.
The config file (aka metadata file) drives InGen. It is essential for the success of this project that this file is easy to write and understand. It is also important that the schema of the config file is versioned, so that users don't have to update their configs whenever a breaking change is introduced. Therefore, we should add versioning and schema validation to InGen.
Describe the solution you'd like
The config file is a YAML file which can be validated using jsonschema. A python library of the same name provides a way to do so. The following code snippet shows a schema that can validate any InGen config which has sources and interfaces defined. This example schema only validates the type of the required outer fields of a config file but shows the possibility of writing a comprehensive schema. Such a schema will ensure that any errors in the config file is detected early and the program fails fast if required. Writing the schema will also uncover some of the complexities of the current config design which can be improved in future versions.
---
title: Interface Generator Metadata Schema
description: This is a sample schema that validates an InGen config
type: object
properties:
interfaces:
type: object
patternProperties:
".*":
type: object
title: interface
properties:
sources:
title: sources
description: list of source identifiers
type: array
columns:
title: columns
description: list of output columns and any formatter applied to them
type: array
output:
title: output
description: properties of the interface output
type: object
sources:
type: array
Describe alternatives you've considered
No alternatives considered yet.
Is your feature request related to a problem? Please describe. The config file (aka metadata file) drives InGen. It is essential for the success of this project that this file is easy to write and understand. It is also important that the schema of the config file is versioned, so that users don't have to update their configs whenever a breaking change is introduced. Therefore, we should add versioning and schema validation to InGen.
Describe the solution you'd like The config file is a YAML file which can be validated using jsonschema. A python library of the same name provides a way to do so. The following code snippet shows a schema that can validate any InGen config which has
sources
andinterfaces
defined. This example schema only validates the type of the required outer fields of a config file but shows the possibility of writing a comprehensive schema. Such a schema will ensure that any errors in the config file is detected early and the program fails fast if required. Writing the schema will also uncover some of the complexities of the current config design which can be improved in future versions.Describe alternatives you've considered No alternatives considered yet.