frictionlessdata / datapackage-py

A Python library for working with Data Packages.

https://frictionlessdata.io

MIT License

191 stars 43 forks source link

datapackage validate: validate files, tabular schema, etc. #270

Closed cpina closed 4 years ago

cpina commented 4 years ago

Overview

Right now I see that if I do datapackage validate datapacakge.json it validates (I think, just a quick test): the JSON is correct, the required fields are present, the fields have the correct type / pass the regular expressions, etc.

We expected some more validations: -If there is a resource and the resource has a local path specified: should validate that the file is there -If there is a resource and the resource has a local path with bytes and hash: validate that the file has the correct bytes/hash -If the resource has a remote URL: download it, validate bytes and hash if possible -If the resource is a tabular data: try to "read" it to validate the columns, missing values and other tabular verifications

All of this is easy to do with (we did). But we expected the validate to do it (or to have flags to do it). Some users of Frictionless Data might not be so keen on implementing the checks by themselves and might just want to use the Python CLI to validate.

Please preserve this line to notify @roll (lead of this repository)

roll commented 4 years ago

Hi @cpina,

Please try goodtables CLI

cpina commented 4 years ago

Hi @cpina,

Please try goodtables CLI

Thanks very much @roll - goodtables CLI does (after a quick look) almost all that what we wanted!

Some comments: -our resources have the md5 because it seems to be the preferred one here: https://specs.frictionlessdata.io/data-resource/#metadata-properties . But goodtables says Warning: Resource "ace_tm_concentrations" does not use the SHA256 hash. The check will be skipped. Should goodtables check the md5? (Do you want that I open an issue? Or should sha1 be the "favourite" one in the data-resource documentation?) -when two columns have the same name: tableschema validate says that the schema is valid, and the read() doesn't complain. But goodtables says that [-,20] [duplicate-header] Header in column 20 is duplicated to header in column(s) 17. I'm happy with goodtables complaining about duplicated names, but should tableschema validate do the same?

Thanks again for making me look at goodtables.

roll commented 4 years ago

Hi @cpina,

I'm currently working on a new version of goodtables which will support md5 and other hash algorithms.

Regarding duplicated field names, it's ok by the specs - https://specs.frictionlessdata.io/table-schema/. It's the reason why datapackage doesn't complain. On the other hand, goodtables also tries to force best practices e.g. not having such field names. This check can be skipped with goodtables data/invalid.csv --skip-checks duplicate-header

cpina commented 4 years ago

Thanks very much! :-) feel free to close this issue (or should we wait to have the md5 support?)

roll commented 4 years ago

I'll merge it into - https://github.com/frictionlessdata/goodtables-py/issues/341