User configuration validation

DRAFT PR

I spent a little bit of time today exploring how we could validate the config file (issue #133). Originally, I was going to try and use frictionless, since thats what you implemented the datapackage parsing with. However, I'm not sure if I was just missing something, not understanding how to use frictionless, or if frictionless doesn't really support this kind of validation; but I couldn't get it working and explored other options. This led me to pydantic which seems to be exactly what we are looking for!

I implemented a super rough and dirty example of how we could use pydantic to validate the config.yaml file. This includes checking for:

datatypes
naming conventions
valid indicies
parameter types
optional arguments

Not all the functionality is implemented right now, since I was just testing it out. But it seems to be pretty flexible in terms of how to validate and how to display errors to the user. Moreover, after validating we can write out the config file with any modifications we make (such as if a user does not capitalize the name of a set, we can automatically fix this).

QUESTIONS

Do you think using pydantic is a good method to move forward with for validation? If so, I will clean up the code, add logic to all the checks, add other checks where needed, and integrate it into otoole.

Or if I just completely missed where this type of validation can be done with frictionless, I'm totally happy to pivot and explore that option further!

Or if there is a different way to do this that you had in mind, I would be super interested in discussing that!

Hi @trevorb1 - you have been busy!

It is always worth weighing up whether to add a new dependency to a Python library, checking to see if the dependency is under active development and what its dependencies are. It is easy to slide into "dependency hell" where it becomes very difficult to manage the combination of packages that are compatible with each other and the releases of Python.

I think this is a reasonable approach, but there might be a simpler way. For example, you can validate YAML using a json schema. The advantage of these approaches is you get ValidationError messages built in and don't have to write custom methods for each datatype. However, you may be restricted to the extent that you can validate more complex dependencies.

So there is a link then between the original datapackage.json files used for the Frictionless Data implementation of datapackages (which I think we should remove for v1.0) and using them to validate the yaml config.

Thanks @willu47! I didn't realize we could validate YAML using a json schema; Ill check that out!

Side note, pydantic also does have a standard ValidationError you can raise. The tutorial I followed showed how to implement custom exceptions after that, so that what I ended up pushing haha.

Hi @willu47!

I took some time today to try and set up a json schema to validate the config, BUT I think I may have over complicated it or just misunderstood your previous comment haha. So I just want to run by what I have done so far to get a second opinion.

In my latest commit, I add logic to the validate_config.py file to create a json schema from the datapackage.json which I can then use to validate the config.yaml file. I couldn't use the raw datapackage.json as the schema, so thats what the first data processing steps are. Moreover, we use the type keyword in the config.yaml to denote either a param, result, or set; however, type is a reserved keyword in a json schema so there is some processing to update that as well.

Before I go ahead and continue down this path of setting up a json schema from the datapackage, I do have a couple questions:

Is this the process you envisioned? I don't have any experience with writing schemas, so just want to check 😊 Maybe I am just missing a way to read in the datapackage.json directly into the jsonschema.validate(...) function? Or if your also unsure, no worries at all! If I am way off base though, please do just let me know!
This isn't actually checking user inputs, its only checking if the config.yaml mathces to the datapackage.json. However, if the user doesn't input data correctly into the datapackage.json we aren't going to flag that. Moreover, writing a generic validation schema for both the config.yaml and datapackage.json seems difficult since they use different keywords to represent similar data?
While we definitively can write a json schema(s) to validate the config.yaml and datapackage.json which can check the user inputs, I'm not sure that is the best approach? Comes back to your point on validating more complex dependencies. Based on the (little bit) of research I did, seems like Pydantic (11.9k stars plus updated for 3.11 already with what seems like a very active community), Schema (2.7k stars), or Cerberus (2.9k Stars) may be better positioned to do this. With that said, I do totally get your point of not wanting to add in unnecessary libraries, so I am happy to stick with the jsonschema library if you think that is the safest option or if Im just totally missing what I am supposed to be doing in this PR 🙃
Should I even be trying to extract out the schema from datapackage.json if we are planning to drop it in the v1.0 release? Or am I better to just start building a schema or implementing logic (via pydatic or silimar) to check the config.yaml

Thanks so much in advance!

Below is a sample of what the schema I am extracting from the datapackage.json looks like:

{
   "type":"object",
   "properties":{
      "MODE_OF_OPERATION":{
         "type":"object",
         "properties":{
            "default":{
               "type":"integer"
            },
            "param":{
               "anyof":[
                  {
                     "pattern":{
                        "enum":[
                           "set"
                        ]
                     }
                  }
               ]
            }
         },
         "required":[
            "param"
         ]
      },
      "AnnualExogenousEmission":{
         "type":"object",
         "properties":{
            "default":{
               "type":"number"
            },
            "param":{
               "anyof":[
                  {
                     "pattern":{
                        "enum":[
                           "param",
                           "result"
                        ]
                     }
                  }
               ]
            },
            "indices":{
               "type":"array",
               "items":{
                  "enum":[
                     "REGION",
                     "EMISSION",
                     "YEAR"
                  ]
               }
            }
         },
         "required":[
            "param"
         ]
      },

Hi @trevorb1 - I think we're going a bit off target here. The objective is to validate the user created YAML file, specifically the structure, format and entries for sets, parameters and results. While there is a link to the datapackage.json, I don't think this should be used as the schema for the YAML file.

The user should be allowed the freedom to define new parameters, sets and results as they customise the osemosys model file.

I suggest taking a test-driven development approach this problem. Create a set of extremely simple YAML examples of legal and illegal config files.

Here's the simplest example I can think of involving one parameter, one set and one result:

TestParam:
    indices: [SET]
    type: param
    dtype: float
    default: 0
SET:
    dtype: string
    type: set
TestResult:
    indices: [SET]
    type: result
    dtype: float
    default: 0
    calculated: True

Now, consider which of these elements is optional, which should be enforced etc. Then identify the technology that will work to provide warnings for that. Then write the tests, then the code...

For example, create a config.yaml fixture and corresponding test with an error:

TestParamVeryLongNameWhichIsLongerThan31Characters:
    indices: [SET]
    type: param
    dtype: float
    default: 0
SET:
    dtype: string
    type: set
TestResult:
    indices: [SET]
    type: result
    dtype: float
    default: 0
    calculated: True

And another:

SET:
    dtype: string
    fieldnotexpected: True
    type: set

And another:

TestParamWithNoSetDefined:
    indices: [SET]
    type: param
    dtype: float
    default: 0
TestResultWithNoSetDefined:
    indices: [SET]
    type: result
    dtype: float
    default: 0
    calculated: True

etc..

A tangent...

While we are at it, perhaps it would make sense to restructure the configuration, for example to use headers for set, parameter and result rather than including these a fields within the entities. For example:

sets:
  SET:
    dtype: string
parameters:
  TestParam:
    indices: [SET]
    dtype: float
    default: 0
results:
  TestResult:
    indices: [SET]
    dtype: float
    default: 0
    calculated: True

This would remove the issue with type being a protected word, but would also require a fundamental restructuring of otoole's datastore!

THIS PR IS STILL IN PROGRESS

Thanks for the additional explanation, @willu47! I thought about this for a little bit yesterday/today and went through the process you suggested (create fixtures, then tests, then logic). However, I guess Im still a little confused of what our goal is with the validation. Are we wanting to only validate the formatting, or both the formatting and logic? Below I provided additional info for each option (as I ended up exploring both), and just want to get a second opinion (again!) please!

The tldr; I think we want formatting + logic and thats what my most recent commit reflects using pydantic. Logic is implemented for both processing and tests. The error handling between pydantic and pytest are causing some tests to fail and I am working on that now. Still need to resolve conflicts.

Validating formatting only

Overview

Using the json schema, as you originally suggested.

Advantages

Simpler
Probably faster
Easier for other people to contribute to, since it's just a json file.

Disadvantages

We would need to restructure the yaml file (or at least how we read it in) to have consistent key words. Similar to what you proposed in your "tangent" section. However, (at least to my understanding) we would need to go one step further and have a consistent name field, such as in the example below. Moreover, we would not be able to validate the whole file at once, and would have to validate each parameter one by one (as shown) to avoid duplicate keys.
We can not check logic during the validation stage. For example, are all indices included, does the dtype match the default value, ect... If we wanted to check these, we would need to add in separate logic to do that.

SET:
  name: REGION
  dtype: int

PARAMETER:
  name: AccumulatedAnnualDemand
  indices: [REGION,FUEL,YEAR]
  dtype: float
  default: 0

RESULT:
  name: AnnualEmissions
  indices: [REGION,EMISSION,YEAR]
  dtype: float
  default: 0
  calculated: False

Example

Below is (fairly) complete json schema we would use for validation of the sample yaml shown above.

{
    "$schema": "https://json-schema.org/draft/2019-09/schema",
    "title": "User Defined Set",
    "description": "this schema is used to validate user defined sets",
    "type": "object",
    "properties": {
        "SET": {
            "description":"A unique set ",
            "type": "object",
            "properties":{
                "name":{
                    "type": "string",
                    "pattern": "^[A-Z_]*$"
                },
                "dtype":{
                    "oneOf": [
                        { "const": "string" },
                        { "const": "int" }
                    ]
                }
            },
            "required":["name", "dtype"],
            "additionalProperties": false
        },
        "PARAMETER":{
            "description":"A unique parameter",
            "type": "object",
            "properties":{
                "name":{
                    "type": "string",
                    "pattern": "^[A-Za-z_]*$"
                },
                "dtype":{
                    "oneOf": [
                        { "const": "int" },
                        { "const": "float" }
                    ]
                },
                "indices": {
                    "type": "array"
                },
                "default": {
                    "type": "number"
                }
            },
            "required": ["name", "dtype", "indices", "default"],
            "additionalProperties": false
        },
        "RESULT":{
            "description":"A unique result",
            "type": "object",
            "properties":{
                "name":{
                    "type": "string",
                    "pattern": "^[A-Za-z_]*$"
                },
                "dtype":{
                    "oneOf": [
                        { "const": "int" },
                        { "const": "float" }
                    ]
                },
                "indices": {
                    "type": "array"
                },
                "default": {
                    "type": "number"
                },
                "calculated": {
                    "type": "boolean"
                }
            },
            "required": ["name", "dtype", "indices", "default", "calculated"],
            "additionalProperties": false
        }
    },
    "required": ["SET"],
    "additionalProperties": false
}

Validating formatting and logic

Overview

Use pydantic to validate formatting and handle logic.

Advantages

Everything is handled through a single library
Greater flexibility in what we can check, if more options begin to be added to the config file

Disadvantages

Additional package dependency
Less approachable when compared to a json file for other people to edit
A little slower when compared to json validation. But since the config isnt a huge file, this shouldn't be much of an issue.
Still can't get away from preprocessing the config file given the current format

My Opinion

I think we are wanting both formatting and logic tested, so thats what my latest commit starts to implement. Pydantic is used to handle the type validation, and logic. I have also added in a new test module which holds the config.yaml fixtures and integrated the validation into the otoole logic. Some of the tests arn't working right now, as Pydantic is doing something weird when raising the error; it says its of type pydantic.ValidationError, but pytest isnt capturing that. Moreover, I have to deal with the conflicts that have popped up

Example of Pydantic

Input config

SET NAME:
  dtype: str
  type: set

After running otoole


1 validation error for UserDefinedSet
name
  SET NAME -> Parameter name can not have spaces (type=value_error)

Overview

Hello @willu47! I think this PR is in a decent state now. As discussed in the last comment I went the Pydantic route to handle the data validation (and logic associated with the validation).

Logic

The major updates in this PR include:

The addition of a /preprocess/validate_config.py file. This module holds the logic associated with validating the config file. More info on this below.
The addition of a test_validate_config.py file. This testing module holds an array of simple config fixtures to test the validation logic (following your suggestion in a previous comment).
The addition of a validate_config(user_config) function in the utils.py file. This slightly restructures the data format of the user config before passing it into the validation functions (as discussed in this comment). This reformatting is only used in the validation logic.

Data Validation Logic

I split the validation into 4 classes;

UserDefinedValue,
UserDefinedParameter(UserDefinedValue),
UserDefinedSet(UserDefinedValue), and
UserDefinedResult(UserDefinedValue)

The idea of this was that the UserDefinedValue class will handle reusable validation logic (such as checking naming conventions), then the classes that inherit from UserDefinedValue will handle logic specific to their type (parameters, sets, and results).

When an error in the config.yaml file is flagged, PyDantic will raise a ValidationError. The logic will capture errors across different parameters, however, it will only capture one error at a time for each parameter. For example, the config file below will raise the accompanying error. It sees that both have invalid types, but will not flag the second error in ParameterNameTwo.

# Has invalid type
ParameterNameOne:
    indices: [VALID_SET]
    type: invalid_type
    dtype: float
    default: 0

# Has invalid type and index
ParameterNameTwo:
    indices: [VALID_SET, INVALID_SET]
    type: invalid_type
    dtype: float
    default: 0

(otoole) trevorb1@DESKTOP-M23U3I0:~/repositories/otoole/trevor$ otoole convert datapackage datafile datapackage.json data.txt config.yaml 
OtooleConfigFileError: 
ParameterNameOne -> Type must be 'param', 'result', or 'set'
ParameterNameTwo -> Type must be 'param', 'result', or 'set'

If we fix the invalid types and rerun the command, the second error with ParameterNameTwo is now caught.

# Valid parameter
ParameterNameOne:
    indices: [VALID_SET]
    type: param
    dtype: float
    default: 0

# Has invalid index
ParameterNameTwo:
    indices: [VALID_SET, INVALID_SET]
    type: param
    dtype: float
    default: 0

OtooleConfigFileError: 
ParameterNameTwo -> Index not in user supplied sets

Other Notes

If this is not what you imagined the validation looking like, I'm totally happy to change up strategies (and just chalk this up to a learning experience)! After using Pydantic, it feels a little overkill for our use haha, but I wanted to get this done for the end of the week and needed to put something together :) If you are wanting to switch strategies, maybe I will just make sure we discuss the objectives of the validation before redoing this haha.

Examples

Here are just a few examples of the functionality!

Duplicate Naming

SampleValue:
    indices: [REGION,FUEL,YEAR]
    type: param
    dtype: float
    default: 0
SAMPLEVALUE:
    dtype: int
    type: set

(otoole) trevorb1@DESKTOP-M23U3I0:~/repositories/otoole/trevor$ otoole convert datapackage datafile datapackage.json data.txt config.yaml 
ValueError: SAMPLEVALUE -> defined more than once

Set not defined

Parameter:
    indices: [REGION,FUEL,YEAR, INVALID]
    type: param
    dtype: float
    default: 0
SET:
    dtype: int
    type: set

(otoole) trevorb1@DESKTOP-M23U3I0:~/repositories/otoole/trevor$ otoole convert datapackage datafile datapackage.json data.txt config.yaml 
OtooleConfigFileError: 
Parameter -> Index not in user supplied sets

Invalid Name Lengths

LongParameterNameThatIsLongerThanThirtyOneChars:
    indices: [REGION,FUEL,YEAR]
    type: param
    dtype: float
    default: 0
AnotherLongParameterNameThatIsLongerThanThirtyOneChars:
    short_name: ShortNameIsAlsoLongerThanThirtyOneChars
    indices: [REGION,FUEL,YEAR]
    type: param
    dtype: float
    default: 0

(otoole) trevorb1@DESKTOP-M23U3I0:~/repositories/otoole/trevor$ otoole convert datapackage datafile datapackage.json data.txt config.yaml 
OtooleConfigFileError: 
LongParameterNameThatIsLongerThanThirtyOneChars -> Name is longer than 31 characters and no 'short_name' field provided
ShortNameIsAlsoLongerThanThirtyOneChars -> Name is longer than 31 characters

Invalid Naming Conventions

Parameter Name With Spaces:
    indices: [REGION,FUEL,YEAR]
    type: param
    dtype: float
    default: 0
ParameterWithSpecialChars%$#@:
    indices: [REGION,FUEL,YEAR]
    type: param
    dtype: float
    default: 0
ParameterWithNumbers12345:
    indices: [REGION,FUEL,YEAR]
    type: param
    dtype: float
    default: 0

(otoole) trevorb1@DESKTOP-M23U3I0:~/repositories/otoole/trevor$ otoole convert datapackage datafile datapackage.json data.txt config.yaml 
OtooleConfigFileError: 
Parameter Name With Spaces -> Name can not have spaces
ParameterWithSpecialChars%$#@ -> Name can not have special characters, except for underscores
ParameterWithNumbers12345 -> Name can not have digits

Invalid dtype

Parameter:
    indices: [REGION,FUEL,YEAR]
    type: param
    dtype: float
    default: not_a_float

(otoole) trevorb1@DESKTOP-M23U3I0:~/repositories/otoole/trevor$ otoole convert datapackage datafile datapackage.json data.txt config.yaml 
OtooleConfigFileError: 
Parameter -> User dtype is float while default value dtype is str

Looks good @trevorb1! Many thanks for this.

OSeMOSYS / otoole