frictionlessdata / datapackage

Data Package is a standard consisting of a set of simple yet extensible specifications to describe datasets, data files and tabular data. It is a data definition language (DDL) and data API that facilitates findability, accessibility, interoperability, and reusability (FAIR) of data.
https://datapackage.org
The Unlicense
498 stars 113 forks source link

Table Schema: contextual metadata about the schema #384

Closed CharlesNepote closed 5 years ago

CharlesNepote commented 7 years ago

I would find very usefull to build schema documentation based on Table Schema. That way documentation and data would stay closed together, because the schema can also be used for validation with CSV Lint.

To do that, we need contextual information about the schema: its title, author, version, and so on... Here is an example to show what I would find useful:

{
    "schema": {
      "title": "Spécification de la liste annuelle des prénoms des nouveaux-nés",
      "author": "Charles Nepote <charles.nepote@fing.org>",
      "date": "mars 2017",
      "version": "0.1beta",
      "description": "La liste annuelle des prénoms des nouveaux-nés est un jeu de données simple et très apprécié du public. Il consiste en une liste de prénoms avec l'occurence de chacun pour une année donnée.",
      "fields": [
        {
          "name": "CODE_INSEE",
          "title": "Code INSEE",
          "description": "Code INSEE de la commune où les prénoms sont enregistrés. Issu du Code officiel géographique, est composé de 5 caractères alphanumériques (les deux premiers correspondent au département et peuvent donc contenir les lettres A et B, utilisées pour la Corse).",
          "type": "string",
          "examples": "06088, 1B002 (pour une commune corse)",
          "constraints": {
            "required": true,
            "pattern": "^([013-9]\\d|2[AB1-9])\\d{3}$"
          }
       ]
    }
}

I made a very quick and (very) dirty tool to show that. Unfortunatly, it's only in french for the moment, I'll translate it if you're interested in: http://dataliteracyconference.net/specificator/demo3.html (link will change).

Adding some contextual informations won't break anything, I think. And open data movement needs more professionnal yet simple tools. Thanks for your efforts!

rufuspollock commented 7 years ago

@CharlesNepote very useful and interesting suggestion.

Much of this metadata looks like it is very similar to the generic descriptive metadata on Data Package and Data Resource so we can probably reuse somewhat.

As a first step would you like to tidy this up into a "pattern" -- our approach atm is first to get a solid pattern, do some community review and then after there's been time to see this is solid and useful to look at whether it goes into the main spec.

CharlesNepote commented 7 years ago

@rufuspollock Do you mean I should propose a new section of the pattern page (with a clear specification)?

[...] do some community review

How? Where?

rufuspollock commented 7 years ago

@CharlesNepote once you have a draft we'll put it on patterns. Review will then

  1. When you submit and we merge the PR 😄
  2. Naturally over time as people try it out
  3. We will flag to e.g. the specs working group
CharlesNepote commented 7 years ago

Here is a very first draft. I wrote it the way the other patterns were written. Is it ok? My english is not as good as I would like to, there are probably mistakes in my proposition.

I tried to keep things simple but many aspects can be discussed:

===========================================

Schemas Documentation

Overview

Documentation is a fundamental aspect of data sharing. Table Schema needs contextual metadata to let people and softwares understand and manage schemas.

That way, people creating or reusing schemas would have a better understanding of this ones. Some softwares would be able to produce user documentation based on schemas. In the future, some others would be able to use metadata for different tasks such as: verifying schema integrity and provenance, schemas crawling and cataloguing, schemas versioning, etc.

Implementations

There are no known implementations at present.

Specification

To allow schema documentation, implementations MUST start with the schema object. This object contains the already defined fields array, as in a classical Tabular Data Package. It also contains some other properties useful for schema documentation. These other properties comes from widely adopted practices to describe a resource.

Each schema MUST have a title which represent the human-readable title of the schema. The title is the only required property.

All other properties SHOULD be implemented:

A user might built a schema.json as follows:

{
  "schema": {
    "title": "Postal codes list schema",
    "author": "Jacques Facteur",
    "publisher": "Postal codes committee",
    "contributor": "Julie Martin, Max Dupont, Estelle Bois",
    "version": "0.1 beta",
    "date": "2017/01/31",
    "description": "Postal codes list schema defines the raw list of postal codes in France.",
    "homepage": "http://example.com/postal-code-list-schema.html",
    "url": "http://example.com/2017/03/13/schema.json",
    "previous": "http://example.com/2017/02/21/schema.json",
    "fields": [
      {
        "name": "postal_code",
        "title": "postal code",
        "type": "string"
      }
    ]
  }
}
joelgombin commented 7 years ago

Thanks @CharlesNepote for this great suggestion. A few remarks:

CharlesNepote commented 7 years ago

I'm not sure whether the contributor field should be an array or a single string.

If it's an array, what is the use case? I think this field is just to thanks contributors and not to a field to allow new usages. Keep it simple.

in the date field, what do you mean "in a free format"? Do you mean there's no standard format to be followed? In that case I would be worried this data can not be used.

I agree. A standard format will allow many interesting use cases: schema cataloguing by date, RSS feeds, etc. I'll precise "ISO-8601" format.

the metadata for a given version of the schema lets you know if there are previous version but not if tehre are more uptodate version...

The homepage field should be used for this purpose. It should the web page where the schema environment is described: history, changelog, versions, etc. I'll precise that in the documentation.

finally, some of the metadata you can find on schema pages (e.g. http://specs.frictionlessdata.io/table-schema/) is not in your documentation

Yes it's intentional. These documentation would take place:

Here is the updated spec. @rufuspollock do I have to make a pull request?

=============================================

Schemas Documentation

Overview

Documentation is a fundamental aspect of data sharing. Table Schema needs contextual metadata to let people and softwares understand and manage schemas.

That way, people creating or reusing schemas would have a better understanding of this ones. Some softwares would be able to produce user documentation based on schemas. In the future, some others would be able to use metadata for different tasks such as: verifying schema integrity and provenance, schemas crawling and cataloguing, schemas versioning, etc.

Implementations

There are no known implementations at present.

Specification

To allow schema documentation, implementations MUST start with the schema object. This object contains the already defined fields array, as in a classical Tabular Data Package. It also contains some other properties useful for schema documentation. These other properties comes from widely adopted practices to describe a resource.

Each schema MUST have a title which represent the human-readable title of the schema. The title is the only required property.

All other properties SHOULD be implemented:

A user might built a schema.json as follows:

{
  "schema": {
    "title": "Postal codes list schema",
    "author": "Jacques Facteur",
    "publisher": "Postal codes committee",
    "contributor": "Julie Martin, Max Dupont, Estelle Bois",
    "version": "0.1 beta",
    "date": "2017-01-31",
    "description": "Postal codes list schema defines the raw list of postal codes in France.",
    "homepage": "http://example.com/postal-code-list-schema.html",
    "url": "http://example.com/2017/03/13/schema.json",
    "previous": "http://example.com/2017/02/21/schema.json",
    "fields": [
      {
        "name": "postal_code",
        "title": "postal code",
        "type": "string"
      }
    ]
  }
}
pwalsh commented 7 years ago

The family of specs already has fields that deal with title, author/contributor/publisher (contributors), date (created), description. We could reuse these on the table schema spec.

CharlesNepote commented 7 years ago

The family of specs already has fields that deal with title, author/contributor/publisher (contributors), date (created), description. We could reuse these on the table schema spec.

Thanks @pwalsh, I appreciate your help.

I made a global search on each spec and here is my observations.

pwalsh commented 7 years ago

Hi @CharlesNepote if you could add the problems / bugs to this issue it would be great https://github.com/frictionlessdata/specs/issues/385

CharlesNepote commented 7 years ago

@pwalsh @rufuspollock

I've updated my spec below, making efforts to take in account existing properties. My proposal creates 3 new properties:

I'm still uncomfortable with 2 properties of frictionlessdata:

Are homepage and contributors definitely decided in frictionless data specs?

========================

Schemas Documentation

Overview

Documentation is a fundamental aspect of data sharing. Table Schema needs contextual metadata to let people and softwares understand and manage schemas.

That way, people creating or reusing schemas would have a better understanding of this ones. Some softwares would be able to produce user documentation based on schemas. In the future, some others would be able to use metadata for different tasks such as: verifying schema integrity and provenance, schemas crawling and cataloguing, schemas versioning, etc.

Implementations

There are no known implementations at present.

Specification

To allow schema documentation, implementations MUST start with the schema object. This object contains the already defined fields array, as in a classical Tabular Data Package. It also contains some other properties useful for schema documentation. These other properties comes from widely adopted practices to describe a resource.

Each schema MUST have a title which represent the human-readable title of the schema. The title is the only required property.

All other properties SHOULD be implemented:

A user might built a schema.json as follows:

{
  "schema": {
    "title": "Postal codes list schema",
    "author": "Jacques Facteur",
    "publisher": "Postal codes committee",
    "contributor": "Julie Martin, Max Dupont, Estelle Bois",
    "version": "0.1 beta",
    "created": "2017-01-31",
    "description": "Postal codes list schema defines the raw list of postal codes in France.",
    "homepage": "http://example.com/postal-code-list-schema.html",
    "uri": "http://example.com/2017/03/13/schema.json",
    "previous": "http://example.com/2017/02/21/schema.json",
    "fields": [
      {
        "name": "postal_code",
        "title": "postal code",
        "type": "string"
      }
    ]
  }
}
cbenz commented 6 years ago

Just to let you know, we generated a Gitbook documentation of some schemas we are using, based on the enhancements proposed by @CharlesNepote, sightly adapted.

For example:

I intend to propose a "pattern" as suggested above, in the next weeks, to reopen the discussion about adding metadata to table schemas, on a concrete base.

rufuspollock commented 5 years ago

@cbenz great - that would be amazing!

johanricher commented 5 years ago

@rufuspollock @pwalsh @frictionlessdata Our team at @jailbreak-paris has increasingly used Table Schema for the past year and a half and we're seeing more and more adoption around us (including at @Etalab). We think it has a bright future. So we've finally rolled up our sleeves and (finally!) got to work on this. Our draft propostion and (most importantly) questions are here: https://github.com/frictionlessdata/specs/pull/627

rufuspollock commented 5 years ago

@johanricher and @cbenz this is awesome, thank-you. I will comment on the PR. I also note you could reuse @CharlesNepote proposal for text (which I think was great - we just did not get it as an actual PR against the patterns page https://github.com/frictionlessdata/specs/blob/master/specs/patterns.md)

johanricher commented 5 years ago

Hey @rufuspollock, thanks for the feedback! We followed your instructions by starting this new PR against the patterns page: #630. As with #627, we built upon @CharlesNepote's idea but also tried to remain closer to the other Frictionless Data specs by using properties from common.yml (e.g. author and publisher as proposed by Charles are redundant with the contributors property).