Hi, so I've come to datacontract-cli with the hope of replacing our current piecemeal data import validation and as a way to formally define our team's expectations regarding data we receive (I realise that, in theory, contracts should be owned by data producers, but that's not the reality at my org).
One of our use cases, that the docs seem to suggest is supported, doesn't really work with this cli as far as I can tell: validating an example CSV file against the model schema.
I understand that this is probably because in reality, csv files have no proper types, so validating them against the schema is not straightforward. However, this kind of validation is a valid use case* and from the docs, it looks like something the cli would do: it says it supports csv and that it validates the schema as part of the test command.
I've attached a zip file containing a datacontract yaml and some example files to demonstrate my point, with an equivalent example for the same data in a json format, highlighting how the behaviour is different and therefore surprising given the difference is not documented.
In this example I have simply set the integer_field to be equal to a string value. Running the test against the bad_data_type_json server gives exactly the result we would expect:
However when we run test for bad_data_type_csv we get this:
On inspection, it appears that for csv files the cli basically just checks all the fields are there during a test, but does not actually check the values.
Missing Field
Given the above result, I thought I'd check out if the validation would at least therefore pick up missing fields in my CSV so I set up servers for json/csv files where the integer_field was missing completely.
Again when running against missing_field_json i got what I was expecting:
But when running against missing_field_csv, I got what appears to be an unhandled error:
(I think there is also a slight subtlety here with nulls vs actually missing keys/columns but I'll avoid getting into that for brevity).
Summary
My basic question is: is schema validation of csv data ever likely to be in scope of this tool? If not, I can understand to some extent taking the line that csv validation against a tech-agnostic schema can't really go much further than checking columns exist without getting messy, but that does limit the usefulness of it as a tool and I think needs making more explicit in the documentation.
Overview
Hi, so I've come to datacontract-cli with the hope of replacing our current piecemeal data import validation and as a way to formally define our team's expectations regarding data we receive (I realise that, in theory, contracts should be owned by data producers, but that's not the reality at my org).
One of our use cases, that the docs seem to suggest is supported, doesn't really work with this cli as far as I can tell: validating an example CSV file against the model schema.
I understand that this is probably because in reality, csv files have no proper types, so validating them against the schema is not straightforward. However, this kind of validation is a valid use case* and from the docs, it looks like something the cli would do: it says it supports csv and that it validates the schema as part of the test command.
* Something we currently do using frictionless.
Detailed Examples
I've attached a zip file containing a datacontract yaml and some example files to demonstrate my point, with an equivalent example for the same data in a json format, highlighting how the behaviour is different and therefore surprising given the difference is not documented.
datacontract_csv_validation_github_issue.zip
Bad Data Type
In this example I have simply set the
integer_field
to be equal to a string value. Running the test against thebad_data_type_json
server gives exactly the result we would expect:However when we run test for
bad_data_type_csv
we get this:On inspection, it appears that for csv files the cli basically just checks all the fields are there during a test, but does not actually check the values.
Missing Field
Given the above result, I thought I'd check out if the validation would at least therefore pick up missing fields in my CSV so I set up servers for json/csv files where the
integer_field
was missing completely.Again when running against
missing_field_json
i got what I was expecting:But when running against
missing_field_csv
, I got what appears to be an unhandled error:(I think there is also a slight subtlety here with nulls vs actually missing keys/columns but I'll avoid getting into that for brevity).
Summary
My basic question is: is schema validation of csv data ever likely to be in scope of this tool? If not, I can understand to some extent taking the line that csv validation against a tech-agnostic schema can't really go much further than checking columns exist without getting messy, but that does limit the usefulness of it as a tool and I think needs making more explicit in the documentation.