frictionlessdata / frictionless-py

Data management framework for Python that provides functionality to describe, extract, validate, and transform tabular data
https://framework.frictionlessdata.io
MIT License
710 stars 148 forks source link

Validate against external profile #618

Closed peterdesmet closed 3 years ago

peterdesmet commented 3 years ago

Overview

I want to validate my package against an external profile that is an extension of tabular-data-package. If I read the specs (https://specs.frictionlessdata.io/profiles/#profile-property) correctly:

Custom profiles MUST have a profile property, where the value is a unique identifier for that profile. This unique identifier MUST be a string and can be in one of two forms. It can be an id from the official Data Package Schema Registry, or, a fully-qualified URL that points directly to a JSON Schema that can be used to validate the profile.

This can be done with:

"profile": "https://my-profile.json",

However, it seems only very specific values for profile are checked, and otherwise it falls back to tabular-data-package? If I validate the same data package with:

  1. `"profile": "tabular-data-package" => valid
  2. "profile": "fiscal-data-package" => several validation errors, including:

The data package has an error: "'fiscal-data-package' is not one of ['tabular-data-package']" at "profile" in metadata and at "allOf/0/properties/profile/enum" in profile

  1. "profile": "https://specs.frictionlessdata.io/schemas/fiscal-data-package.json" => valid

Is this expected behaviour? Would it be possible to:

  1. Validate against the linked schema, if that is a valid JSON Schema?
  2. Raise an error if the value for profile is not a valid JSON Schema or a string value listed on https://specs.frictionlessdata.io/schemas/registry.json?

Please preserve this line to notify @roll (lead of this repository)

roll commented 3 years ago

Thanks @peterdesmet,

I'll investigate

peterdesmet commented 3 years ago

Hi @roll any news on this issue? The fact that we can't validate our camera trap DP profile as an extension of tabular-data-package is currently blocking its release.

roll commented 3 years ago

@peterdesmet Thanks for heads-up, I'll prioritize this issue

roll commented 3 years ago

Hi @peterdesmet,

Please try frictionless@4.12 - https://github.com/frictionlessdata/frictionless-py/blob/main/tests/test_package.py#L924-L988

It now supports external profiles although the profile registry is going to be deprecated so it's only for direct profile links local or remote.

peterdesmet commented 3 years ago

Great! My first step was testing it out on an existing data package with typical "profile": "tabular-data-package":

frictionless validate https://raw.githubusercontent.com/tdwg/dwc-for-biologging/master/derived/camtrap-dp/data/raw/datapackage.json
# -----
# valid: deployments.csv
# -----
# -----
# valid: multimedia.csv
# -----
# -----
# valid: observations.csv
# -----

But in frictionless@4.12.1 I get:

frictionless validate https://raw.githubusercontent.com/tdwg/dwc-for-biologging/master/derived/camtrap-dp/data/raw/datapackage.json
# -------
# invalid: https://raw.githubusercontent.com/tdwg/dwc-for-biologging/master/derived/camtrap-dp/data/raw/datapackage.json
# -------
=====  ====================================================================================================================
code   message                                                                                                             
=====  ====================================================================================================================
error  cannot extract metadata "tabular-data-package" because "[Errno 2] No such file or directory: 'tabular-data-package'"
=====  ====================================================================================================================
peterdesmet commented 3 years ago

It does work with "profile": "https://specs.frictionlessdata.io/schemas/data-package.json", but many existing data package just have a string (e.g. tabular-data-package) identifying the profile (one from the registry), which should likely be kept for backwards compatibility.

peterdesmet commented 3 years ago

Other than that, "profile": "https://raw.githubusercontent.com/tdwg/camtrap-dp/0.1.3/camtrap-dp-profile.json" works splendidly! 🎉 This is absolutely fantastic!

Returning errors for camtrap-dp-profile AND data-package which it is build on:

frictionless validate test/datapackage.json 
# -------
# invalid: test/datapackage.json
# -------
=============  ==================================================================================================================================================================================================================================================
code           message                                                                                                                                                                                                                                           
=============  ==================================================================================================================================================================================================================================================
package-error  The data package has an error: "'contribustor' is not one of ['publisher', 'author', 'maintainer', 'wrangler', 'contributor']" at "contributors/0/role" in metadata and at "allOf/0/properties/contributors/items/properties/role/enum" in profile
package-error  The data package has an error: "'hello' is not of type 'boolean'" at "multimedia_access/public" in metadata and at "allOf/1/properties/multimedia_access/properties/public/type" in profile                                                       
package-error  The data package has an error: "'url' is a required property" at "organizations/0" in metadata and at "allOf/1/properties/organizations/items/required" in profile                                                                                
package-error  The data package has an error: "'d' is not of type 'integer'" at "taxonomic/0/count" in metadata and at "allOf/1/properties/taxonomic/items/properties/count/type" in profile                                                                     
=============  ==================================================================================================================================================================================================================================================
roll commented 3 years ago

Thanks @peterdesmet,

I'll fix it tomorrow morning. The problem that Frictionless is not shipped with Tabular Data Package as it uses a more sophisticated validation approach (kind of object-based using Schema/Resource/etc profiles separately). But now I see that I need to include it.

In general, I think we need to think of slightly reworking the concept of the profile on the spec level as it leads to some variation problems like in https://github.com/frictionlessdata/specs/issues/743. Currently, it lacks composability in my opinion.

roll commented 3 years ago

@peterdesmet I'm releasing frictionless@4.12.2 with a fix.

Generally speaking, I would recommend using tabular-data-package and having tabular-data-resource on tabular resources is enough to validate it

peterdesmet commented 3 years ago

@roll do you mean: not having tabular-data-resource at package level, but just indicating your resources as tabular-data-resource? That makes sense to me, since I assume there isn't much more validation happening at package level for tabular data resources (that is different from data-resource)?

roll commented 3 years ago

@peterdesmet Yes. E.g. for the data-package profile internally it just drops the resources JSON Schema rules from the package profile and uses it for every resource individually