Closed trevorb1 closed 1 year ago
Hi @trevorb1 - you have been busy!
It is always worth weighing up whether to add a new dependency to a Python library, checking to see if the dependency is under active development and what its dependencies are. It is easy to slide into "dependency hell" where it becomes very difficult to manage the combination of packages that are compatible with each other and the releases of Python.
I think this is a reasonable approach, but there might be a simpler way. For example, you can validate YAML using a json schema. The advantage of these approaches is you get ValidationError
messages built in and don't have to write custom methods for each datatype. However, you may be restricted to the extent that you can validate more complex dependencies.
So there is a link then between the original datapackage.json
files used for the Frictionless Data implementation of datapackages (which I think we should remove for v1.0) and using them to validate the yaml config.
Thanks @willu47! I didn't realize we could validate YAML using a json schema; Ill check that out!
Side note, pydantic also does have a standard ValidationError
you can raise. The tutorial I followed showed how to implement custom exceptions after that, so that what I ended up pushing haha.
Hi @willu47!
I took some time today to try and set up a json schema to validate the config, BUT I think I may have over complicated it or just misunderstood your previous comment haha. So I just want to run by what I have done so far to get a second opinion.
In my latest commit, I add logic to the validate_config.py
file to create a json schema from the datapackage.json
which I can then use to validate the config.yaml
file. I couldn't use the raw datapackage.json
as the schema, so thats what the first data processing steps are. Moreover, we use the type
keyword in the config.yaml
to denote either a param
, result
, or set
; however, type
is a reserved keyword in a json schema so there is some processing to update that as well.
Before I go ahead and continue down this path of setting up a json schema from the datapackage, I do have a couple questions:
datapackage.json
directly into the jsonschema.validate(...)
function? Or if your also unsure, no worries at all! If I am way off base though, please do just let me know! config.yaml
mathces to the datapackage.json
. However, if the user doesn't input data correctly into the datapackage.json
we aren't going to flag that. Moreover, writing a generic validation schema for both the config.yaml
and datapackage.json
seems difficult since they use different keywords to represent similar data? config.yaml
and datapackage.json
which can check the user inputs, I'm not sure that is the best approach? Comes back to your point on validating more complex dependencies. Based on the (little bit) of research I did, seems like Pydantic (11.9k stars plus updated for 3.11 already with what seems like a very active community), Schema (2.7k stars), or Cerberus (2.9k Stars) may be better positioned to do this. With that said, I do totally get your point of not wanting to add in unnecessary libraries, so I am happy to stick with the jsonschema
library if you think that is the safest option or if Im just totally missing what I am supposed to be doing in this PR 🙃datapackage.json
if we are planning to drop it in the v1.0
release? Or am I better to just start building a schema or implementing logic (via pydatic or silimar) to check the config.yaml
Thanks so much in advance!
Below is a sample of what the schema I am extracting from the datapackage.json
looks like:
{
"type":"object",
"properties":{
"MODE_OF_OPERATION":{
"type":"object",
"properties":{
"default":{
"type":"integer"
},
"param":{
"anyof":[
{
"pattern":{
"enum":[
"set"
]
}
}
]
}
},
"required":[
"param"
]
},
"AnnualExogenousEmission":{
"type":"object",
"properties":{
"default":{
"type":"number"
},
"param":{
"anyof":[
{
"pattern":{
"enum":[
"param",
"result"
]
}
}
]
},
"indices":{
"type":"array",
"items":{
"enum":[
"REGION",
"EMISSION",
"YEAR"
]
}
}
},
"required":[
"param"
]
},
Hi @trevorb1 - I think we're going a bit off target here. The objective is to validate the user created YAML file, specifically the structure, format and entries for sets, parameters and results. While there is a link to the datapackage.json, I don't think this should be used as the schema for the YAML file.
The user should be allowed the freedom to define new parameters, sets and results as they customise the osemosys model file.
I suggest taking a test-driven development approach this problem. Create a set of extremely simple YAML examples of legal and illegal config files.
Here's the simplest example I can think of involving one parameter, one set and one result:
TestParam:
indices: [SET]
type: param
dtype: float
default: 0
SET:
dtype: string
type: set
TestResult:
indices: [SET]
type: result
dtype: float
default: 0
calculated: True
Now, consider which of these elements is optional, which should be enforced etc. Then identify the technology that will work to provide warnings for that. Then write the tests, then the code...
For example, create a config.yaml fixture and corresponding test with an error:
TestParamVeryLongNameWhichIsLongerThan31Characters:
indices: [SET]
type: param
dtype: float
default: 0
SET:
dtype: string
type: set
TestResult:
indices: [SET]
type: result
dtype: float
default: 0
calculated: True
And another:
SET:
dtype: string
fieldnotexpected: True
type: set
And another:
TestParamWithNoSetDefined:
indices: [SET]
type: param
dtype: float
default: 0
TestResultWithNoSetDefined:
indices: [SET]
type: result
dtype: float
default: 0
calculated: True
etc..
While we are at it, perhaps it would make sense to restructure the configuration, for example to use headers for set, parameter and result rather than including these a fields within the entities. For example:
sets:
SET:
dtype: string
parameters:
TestParam:
indices: [SET]
dtype: float
default: 0
results:
TestResult:
indices: [SET]
dtype: float
default: 0
calculated: True
This would remove the issue with type
being a protected word, but would also require a fundamental restructuring of otoole's datastore!
THIS PR IS STILL IN PROGRESS
Thanks for the additional explanation, @willu47! I thought about this for a little bit yesterday/today and went through the process you suggested (create fixtures, then tests, then logic). However, I guess Im still a little confused of what our goal is with the validation. Are we wanting to only validate the formatting, or both the formatting and logic? Below I provided additional info for each option (as I ended up exploring both), and just want to get a second opinion (again!) please!
The tldr; I think we want formatting + logic and thats what my most recent commit reflects using pydantic. Logic is implemented for both processing and tests. The error handling between pydantic and pytest are causing some tests to fail and I am working on that now. Still need to resolve conflicts.
Using the json schema, as you originally suggested.
name
field, such as in the example below. Moreover, we would not be able to validate the whole file at once, and would have to validate each parameter one by one (as shown) to avoid duplicate keys. dtype
match the default value, ect... If we wanted to check these, we would need to add in separate logic to do that. SET:
name: REGION
dtype: int
PARAMETER:
name: AccumulatedAnnualDemand
indices: [REGION,FUEL,YEAR]
dtype: float
default: 0
RESULT:
name: AnnualEmissions
indices: [REGION,EMISSION,YEAR]
dtype: float
default: 0
calculated: False
Below is (fairly) complete json schema we would use for validation of the sample yaml shown above.
{
"$schema": "https://json-schema.org/draft/2019-09/schema",
"title": "User Defined Set",
"description": "this schema is used to validate user defined sets",
"type": "object",
"properties": {
"SET": {
"description":"A unique set ",
"type": "object",
"properties":{
"name":{
"type": "string",
"pattern": "^[A-Z_]*$"
},
"dtype":{
"oneOf": [
{ "const": "string" },
{ "const": "int" }
]
}
},
"required":["name", "dtype"],
"additionalProperties": false
},
"PARAMETER":{
"description":"A unique parameter",
"type": "object",
"properties":{
"name":{
"type": "string",
"pattern": "^[A-Za-z_]*$"
},
"dtype":{
"oneOf": [
{ "const": "int" },
{ "const": "float" }
]
},
"indices": {
"type": "array"
},
"default": {
"type": "number"
}
},
"required": ["name", "dtype", "indices", "default"],
"additionalProperties": false
},
"RESULT":{
"description":"A unique result",
"type": "object",
"properties":{
"name":{
"type": "string",
"pattern": "^[A-Za-z_]*$"
},
"dtype":{
"oneOf": [
{ "const": "int" },
{ "const": "float" }
]
},
"indices": {
"type": "array"
},
"default": {
"type": "number"
},
"calculated": {
"type": "boolean"
}
},
"required": ["name", "dtype", "indices", "default", "calculated"],
"additionalProperties": false
}
},
"required": ["SET"],
"additionalProperties": false
}
Use pydantic to validate formatting and handle logic.
I think we are wanting both formatting and logic tested, so thats what my latest commit starts to implement. Pydantic is used to handle the type validation, and logic. I have also added in a new test module which holds the config.yaml fixtures and integrated the validation into the otoole logic. Some of the tests arn't working right now, as Pydantic is doing something weird when raising the error; it says its of type pydantic.ValidationError
, but pytest isnt capturing that. Moreover, I have to deal with the conflicts that have popped up
Input config
SET NAME:
dtype: str
type: set
After running otoole
1 validation error for UserDefinedSet
name
SET NAME -> Parameter name can not have spaces (type=value_error)
Hello @willu47! I think this PR is in a decent state now. As discussed in the last comment I went the Pydantic route to handle the data validation (and logic associated with the validation).
The major updates in this PR include:
/preprocess/validate_config.py
file. This module holds the logic associated with validating the config file. More info on this below. test_validate_config.py
file. This testing module holds an array of simple config fixtures to test the validation logic (following your suggestion in a previous comment). validate_config(user_config)
function in the utils.py
file. This slightly restructures the data format of the user config before passing it into the validation functions (as discussed in this comment). This reformatting is only used in the validation logic. I split the validation into 4 classes;
UserDefinedValue
, UserDefinedParameter(UserDefinedValue)
, UserDefinedSet(UserDefinedValue)
, and UserDefinedResult(UserDefinedValue)
The idea of this was that the UserDefinedValue
class will handle reusable validation logic (such as checking naming conventions), then the classes that inherit from UserDefinedValue
will handle logic specific to their type (parameters
, sets
, and results
).
When an error in the config.yaml
file is flagged, PyDantic will raise a ValidationError
. The logic will capture errors across different parameters, however, it will only capture one error at a time for each parameter. For example, the config file below will raise the accompanying error. It sees that both have invalid types, but will not flag the second error in ParameterNameTwo
.
# Has invalid type
ParameterNameOne:
indices: [VALID_SET]
type: invalid_type
dtype: float
default: 0
# Has invalid type and index
ParameterNameTwo:
indices: [VALID_SET, INVALID_SET]
type: invalid_type
dtype: float
default: 0
(otoole) trevorb1@DESKTOP-M23U3I0:~/repositories/otoole/trevor$ otoole convert datapackage datafile datapackage.json data.txt config.yaml
OtooleConfigFileError:
ParameterNameOne -> Type must be 'param', 'result', or 'set'
ParameterNameTwo -> Type must be 'param', 'result', or 'set'
If we fix the invalid types and rerun the command, the second error with ParameterNameTwo
is now caught.
# Valid parameter
ParameterNameOne:
indices: [VALID_SET]
type: param
dtype: float
default: 0
# Has invalid index
ParameterNameTwo:
indices: [VALID_SET, INVALID_SET]
type: param
dtype: float
default: 0
OtooleConfigFileError:
ParameterNameTwo -> Index not in user supplied sets
If this is not what you imagined the validation looking like, I'm totally happy to change up strategies (and just chalk this up to a learning experience)! After using Pydantic, it feels a little overkill for our use haha, but I wanted to get this done for the end of the week and needed to put something together :) If you are wanting to switch strategies, maybe I will just make sure we discuss the objectives of the validation before redoing this haha.
Here are just a few examples of the functionality!
SampleValue:
indices: [REGION,FUEL,YEAR]
type: param
dtype: float
default: 0
SAMPLEVALUE:
dtype: int
type: set
(otoole) trevorb1@DESKTOP-M23U3I0:~/repositories/otoole/trevor$ otoole convert datapackage datafile datapackage.json data.txt config.yaml
ValueError: SAMPLEVALUE -> defined more than once
Parameter:
indices: [REGION,FUEL,YEAR, INVALID]
type: param
dtype: float
default: 0
SET:
dtype: int
type: set
(otoole) trevorb1@DESKTOP-M23U3I0:~/repositories/otoole/trevor$ otoole convert datapackage datafile datapackage.json data.txt config.yaml
OtooleConfigFileError:
Parameter -> Index not in user supplied sets
LongParameterNameThatIsLongerThanThirtyOneChars:
indices: [REGION,FUEL,YEAR]
type: param
dtype: float
default: 0
AnotherLongParameterNameThatIsLongerThanThirtyOneChars:
short_name: ShortNameIsAlsoLongerThanThirtyOneChars
indices: [REGION,FUEL,YEAR]
type: param
dtype: float
default: 0
(otoole) trevorb1@DESKTOP-M23U3I0:~/repositories/otoole/trevor$ otoole convert datapackage datafile datapackage.json data.txt config.yaml
OtooleConfigFileError:
LongParameterNameThatIsLongerThanThirtyOneChars -> Name is longer than 31 characters and no 'short_name' field provided
ShortNameIsAlsoLongerThanThirtyOneChars -> Name is longer than 31 characters
Parameter Name With Spaces:
indices: [REGION,FUEL,YEAR]
type: param
dtype: float
default: 0
ParameterWithSpecialChars%$#@:
indices: [REGION,FUEL,YEAR]
type: param
dtype: float
default: 0
ParameterWithNumbers12345:
indices: [REGION,FUEL,YEAR]
type: param
dtype: float
default: 0
(otoole) trevorb1@DESKTOP-M23U3I0:~/repositories/otoole/trevor$ otoole convert datapackage datafile datapackage.json data.txt config.yaml
OtooleConfigFileError:
Parameter Name With Spaces -> Name can not have spaces
ParameterWithSpecialChars%$#@ -> Name can not have special characters, except for underscores
ParameterWithNumbers12345 -> Name can not have digits
Parameter:
indices: [REGION,FUEL,YEAR]
type: param
dtype: float
default: not_a_float
(otoole) trevorb1@DESKTOP-M23U3I0:~/repositories/otoole/trevor$ otoole convert datapackage datafile datapackage.json data.txt config.yaml
OtooleConfigFileError:
Parameter -> User dtype is float while default value dtype is str
Looks good @trevorb1! Many thanks for this.
DRAFT PR
I spent a little bit of time today exploring how we could validate the config file (issue #133). Originally, I was going to try and use frictionless, since thats what you implemented the datapackage parsing with. However, I'm not sure if I was just missing something, not understanding how to use frictionless, or if frictionless doesn't really support this kind of validation; but I couldn't get it working and explored other options. This led me to pydantic which seems to be exactly what we are looking for!
I implemented a super rough and dirty example of how we could use pydantic to validate the
config.yaml
file. This includes checking for:Not all the functionality is implemented right now, since I was just testing it out. But it seems to be pretty flexible in terms of how to validate and how to display errors to the user. Moreover, after validating we can write out the config file with any modifications we make (such as if a user does not capitalize the name of a set, we can automatically fix this).
QUESTIONS
Do you think using pydantic is a good method to move forward with for validation? If so, I will clean up the code, add logic to all the checks, add other checks where needed, and integrate it into otoole.
Or if I just completely missed where this type of validation can be done with frictionless, I'm totally happy to pivot and explore that option further!
Or if there is a different way to do this that you had in mind, I would be super interested in discussing that!