frictionlessdata / datapackage

Data Package is a standard consisting of a set of simple yet extensible specifications to describe datasets, data files and tabular data. It is a data definition language (DDL) and data API that facilitates findability, accessibility, interoperability, and reusability (FAIR) of data.
https://datapackage.org
The Unlicense
488 stars 112 forks source link

Virtual columns in Table Schema - Constants #529

Open pwalsh opened 6 years ago

pwalsh commented 6 years ago

This text is almost a copy/paste from work by @akariv - Adam, please edit directly, or comment below, as needed.

Context

Data sources don’t always contain all the data necessary to use them properly and simply.

For example, some yearly budget files in specific countries won’t have a column with the fiscal year or the country code. Publishers for such files assume that anyone downloading the file would know which year and country this datasets belongs to. However, when adding such datasets into global repositories of fiscal data, the omission of this data from the actual rows becomes evident.

This problem is resolved by introducing the constant property for table schema fields. If this property exists and contains a value, then this field will be ignored when reading the data from the data source, and it will be added afterward with the correct data type, etc.

Example:

{
   "resources": [
     {
        ...
        "schema": {
           ...
           "fields": [
             ...
             {
                 "name": "country-code",
                 "title": "Country Code",
                 "type": "string",
                 "constant": "UK"
             }
           ]
        }
     }
   ]
}

Implementation

Fiscal Data Package already has a constant property at the level of the model (an abstraction layer above one or many resources). With changes we are implementing on Fiscal Data Package, it makes more sense to move constant the level of Table Schema Fields, essentially, introducing the idea of virtual columns. We are doing this for Fiscal Data Package in any event, but we do think this is generally useful for Table Schema in general, and we would like to see this implementation of contents represented with virtual columns land in Table Schema v1.1.

The constant value can either be a logical value (when possible) or a physical value, in which case type and format rules will apply as usual.

rufuspollock commented 6 years ago

@pwalsh 👍 here. My only though is the relation of this to logical vs physical model. This is really part of the logical model not the physical model. If we did have a "distinct" model object like in FDP then it might make sense for it to go on that reather than tableschema.

akariv commented 6 years ago

I added a short explanation re physical and logical.

rufuspollock commented 6 years ago

@pwalsh 👍 on this. Think it is valuable - and people can just ignore it which gives graceful degradation for non-supporting systems.

vitorbaptista commented 6 years ago

What should be the behaviour of exporting a datapackage that has a virtual column to a XLS, for example? Should these columns be added to the exported file? Similarly, when iterating over a datapackage's resource with a virtual column, should the library add these columns to the returned rows? Are these questions something we want to define in the spec at all?

pwalsh commented 6 years ago

@vitorbaptista I don't think we want to have that in the spec, but perhaps has implementation recommendations.

What do you think?

Stephen-Gates commented 6 years ago

To add a real world example, I have a large collection of tide gauge data with constants stored in a readme. The constant feature would help me avoid adding 16 constants to datasets containing around 180,000 rows.

I agree with Vitor's questions. I think implementation guidance is needed and also publisher guidance e.g. when should you add columns to the data vs add constants to the schema.

akariv commented 6 years ago

We ended up taking a different approach in the fiscal data package (this is the working draft https://hackmd.io/BwNgpgrCDGDsBMBaAhtALARkWsPEE5posR8RxgAzffWfDIA=?view), in which we don't modify the table schema itself but rather use another property in the descriptor for that.