frictionlessdata / datapackage

Data Package is a standard consisting of a set of simple yet extensible specifications to describe datasets, data files and tabular data. It is a data definition language (DDL) and data API that facilitates findability, accessibility, interoperability, and reusability (FAIR) of data.
https://datapackage.org
The Unlicense
496 stars 113 forks source link

Future possibility for delimiter-separated list for arrays (instead of JSON array)? #736

Closed thbar closed 7 months ago

thbar commented 3 years ago

I'm a colleague of @geoffreyaldebert, working on a French national CSV schema for "bikes counting" at the moment.

My understanding is that https://github.com/frictionlessdata/specs/issues/712 https://github.com/frictionlessdata/frictionless-py/issues/627 introduced a way to restrict allowed values in an array, which is neat.

In our case (based on input from future data producers and reusers), we would like to avoid using JSON arrays for those values, and instead use delimiter-separated values, which are less complicated to write and decode without troubles for less technical users.

The rationale is that we are creating a CSV schema to avoid JSON in the first place, which some users find confusing with their current level of technicality, to drive adoption.

Our current solution (WIP, the schema is not published yet) is to use a regex pattern:

https://github.com/etalab/schema-comptage-velo/blob/15096e6145b4926530a6fc5126db8cd25e35c803/schema.json#L175-L184

It is a trick commonly used before for that case (e.g. https://schema.data.gouv.fr/etalab/schema-inclusion-numerique/latest/documentation.html#propriété-public_cible).

So my question is: is there room to consider future evolutions to add a "CSV-array" column type, with restrictions on actual values to be in an allowed range?

Thanks!

roll commented 3 years ago

We discussed it in Discord and I think that type: array; format: separator to have something like:

id,array
1,"A,B,C"

might make sense for the specs

thbar commented 3 years ago

FWIW, I have had some feedback from users who would possibly appreciate to have a non-comma separator (e.g. |), which is a "lower tech" way to achieve this and requires less escaping. I am not sure I want to encourage that, though. Ideally just a , as a separator would be quite coherent with the regular case.

Thanks for considering this, it would be great to have and would let us clean a few schemas!

AyrtonB commented 3 years ago

I'm currently working on a PR to integrate this. I've made the relevant changes in array.py but now need to integrate it elsewhere and add tests.

Currently I'm getting this error (below), which I can get rid of if I remove the format entry for the field. FrictionlessException: [field-error] Field is not valid: "{'name': 'sett_bmu_id', 'type': 'array', 'format': ', ', 'array_item': {'type': 'string'}, 'description': 'The Balancing Mechanism Unit identifier used for settlement purposes by Elexon', 'title': 'Settlement BMU ID'} is not valid under any of the given schemas" at "" in metadata and at "anyOf" in profile

Where should I be looking to add this to the schema?

roll commented 3 years ago

@AyrtonB It must be a JSONSchema rule in frictionless/assets/profiles/schema/general.json. We need to update the format definition there for array types

AyrtonB commented 3 years ago

That makes sense, I'll do that

AyrtonB commented 3 years ago

I'll continue discussion around specifics of this implementation in the PR linked above

jze commented 1 year ago

Is is already possible to specify arrays without the square brackets? I would say it is the normal case for CSV files. You have a value like 594866,594868,608288 and each number references to a primary key in another CSV files.

roll commented 1 year ago

Hi, I've created a feature request for the framework to pilot the feature: