frictionlessdata / datapackage

Data Package is a standard consisting of a set of simple yet extensible specifications to describe datasets, data files and tabular data. It is a data definition language (DDL) and data API that facilitates findability, accessibility, interoperability, and reusability (FAIR) of data.
https://datapackage.org
The Unlicense
488 stars 112 forks source link

Encapsulated extensibility #103

Closed Stiivi closed 4 years ago

Stiivi commented 10 years ago

The data package spec says:

"A Data Package author MAY add any number of additional fields beyond those listed in the specification here. "

And the temporal entry example follows.

I'm a bit worried that too much freedom of this kind of extensibility might be contra-productive later. Why?

  1. Namespace is limited
  2. Many authors (of packages) – many meanings
  3. Many tool creators – many varying expectations

Namespace: The authors of the specification are the namespace owners. They have the power over what goes in and what stays out – because they are the standard creators. If the standard has to stay "lean", then the namespace should contain only those entries that are really necessary.

Author's keys and Tool's expectations: as there are many authors of the package, they might put their own metadata under keys they want. Regardless of existence of the same key with different meaning in other packages. The same can be said about tool (visualisation, ETL, mining, ...) creators: they might expect certain types of values or structure under a key, but they will receive invalid values, because of some author decided to use that key for something else.

The example with temporal is not very appropriate example of customisation. That piece of metadata might be really useful for ETL tools, but it is not part of the standard! The value can be anything. Or the value can be as stated in the example, but the key might be something else, for example time_range.

Proposal

This is very important for metadata that is going to be used by tools mostly – for automation or application processing.

  1. Strictly guard the namespace and have the known keys/values part of the specification
  2. Have a separate structure for custom metadata, for example custom = { ... }.

Don't allow authors to put any keys in the top-level. Discourage them to add keys under any other known/specified object.

Recommend authors to put custom keys under that encapsulated customisable structure. You don't guarantee the contents of that structure – no guarantee for keys neither their values.

The package authors and tool creators can have an agreement to use certain keys in that structure. When you will see that it spreads well, then you can put it into the specification at the top level with well defined contents (for example the most used one).

Larger custom metadata can be even put in a separate .json file. As for smaller structures (mostly objects in lists), such as sources or fields, it might not be forbidden but at least discouraged.

Alternative

If you would like to allow custom top-level keys, they I would suggest to have a wiki (github wiki?) page where package writers and tool creators will write their metadata and expected value.

Use-case

Here is an use-case from another project:

In Cubes we try to minimise the number of keys in the model metadata. Every model object (cube, dimension, attribute, ...) has an info dictionary with custom keys and values. Visualization tools sometimes require more metadata or hinting to be able to properly or nicely display the data. The tool writer will add a soft requirement for a custom key – in the info dictionary. It might be cosmetic (colour, formatting, image, ...), data related (time range) or metadata related (calendar unit of an attribute).

The cubes-viewer is a visualisation application that connects to the cubes server. It gets the metadata and provides user interface based on the metadata (labels, relationships, concept hierarchies, ...). The app has a special case of handling time series. The app recommended for model creators to add cv- keys to the info dictionary. For example, to denote that a field represents a year:

                    "info": { "cv-datefilter-field": "year" }

Cubes had no explicit notion of date/time before. However, as the concept of time is important for the data analysis, roles of dimensions and their levels were introduced. The attribute.info.cv-datefilter-field is now attribute.role (and it might be related to non-time dims as well). It went from single-purpose non-standard to multi-purpose standard. Attribute role is now generated by the server automatically if not specified explicitly by the model author.

Conclusion

Why we need metadata in JSON? To be machine processable. Moreover, to have data machine processable based on the machine processable metadata. That is not the purpose, then we would be fine with just plain README.md.

It might seem to be a bit against evolution of the package format, but it is not. It is just guarded evolution that prevents future incompatibility mess. The separate metadata are just incubated until proven stable enough and standardised enough to be included in the specification.

rufuspollock commented 10 years ago

@Stiivi this is a very sensible suggestion. I wonder if we have a link to #87 (profiles). Maybe we say that if you want to specialize for a profile you should add a key with profile name and also custom stuff goes under there (and if you want stuff at top level you've got to make an actual change to the spec - requiring discussion here).

pwalsh commented 8 years ago

@rgrp I think by now, we've accepted (implicitly) that a data package can have any additional properties the creator desires. So, we either reconsider that explicitly, or close this as WONTFIX. WDYT?

rufuspollock commented 8 years ago

@pwalsh i think the point here is thinking about if there is any best practice way to indicate extensions and how we manage conflict over namespaces ...

The issue is I'm not sure of exact way forward. I am inclining to wontfix but I'd be interested in any thoughts from @Stiivi (or anyone else) on best practice here and best way forward.

Stiivi commented 8 years ago

Best practice is to keep the standard structure strictly non-polluted and to have a designated place/property to store all extensions in without any restrictions. Let it evolve, observe and potentially include in the standards. If others fight over a namespace in the "extensions" space, then it is their problem, not standard's problem, they should be aware of it and should resolve it. That space is a wild west and it is ok.

Use cases of here proposed extensions metadata are:

  1. data packages I produce within the organisation I work for to consumption of tools within the same organisation to carry internal metadata
  2. metadata that my organisation and some third-party organisation agreed upon sharing

Extensions should be strippable from the data package without affecting usefulness of the data package in the outside world. They might eventually become standard once accepted by significant number of data package producers and consumers.

Re Profiles: Not sure if I understand the proposal and usefulness of it. The name profile is confusing to me – that would be something I'll keep on my end separate from the data package and would have many-per-user specification of view of data in the referenced data package.

Standard is here for a reason – tools can rely on guaranteed existence of properties and their semantics. Since people don't reads standards and given that top-level dictionary can include custom keys, then the line between what is standard and what is not is blurred. If I am a tool builder, I should not worry about foreign extensions, but mine. In this case, I don't know what should I worry about or should just ignore.

rufuspollock commented 4 years ago

DUPLICATE. I'm closing this in favour of the newer issue that covers the same core ground #663.