Table Schema: contextual metadata about the schema

CharlesNepote commented 7 years ago

I would find very usefull to build schema documentation based on Table Schema. That way documentation and data would stay closed together, because the schema can also be used for validation with CSV Lint.

To do that, we need contextual information about the schema: its title, author, version, and so on... Here is an example to show what I would find useful:

{
    "schema": {
      "title": "Spécification de la liste annuelle des prénoms des nouveaux-nés",
      "author": "Charles Nepote <charles.nepote@fing.org>",
      "date": "mars 2017",
      "version": "0.1beta",
      "description": "La liste annuelle des prénoms des nouveaux-nés est un jeu de données simple et très apprécié du public. Il consiste en une liste de prénoms avec l'occurence de chacun pour une année donnée.",
      "fields": [
        {
          "name": "CODE_INSEE",
          "title": "Code INSEE",
          "description": "Code INSEE de la commune où les prénoms sont enregistrés. Issu du Code officiel géographique, est composé de 5 caractères alphanumériques (les deux premiers correspondent au département et peuvent donc contenir les lettres A et B, utilisées pour la Corse).",
          "type": "string",
          "examples": "06088, 1B002 (pour une commune corse)",
          "constraints": {
            "required": true,
            "pattern": "^([013-9]\\d|2[AB1-9])\\d{3}$"
          }
       ]
    }
}

I made a very quick and (very) dirty tool to show that. Unfortunatly, it's only in french for the moment, I'll translate it if you're interested in: http://dataliteracyconference.net/specificator/demo3.html (link will change).

Adding some contextual informations won't break anything, I think. And open data movement needs more professionnal yet simple tools. Thanks for your efforts!

rufuspollock commented 7 years ago

@CharlesNepote very useful and interesting suggestion.

Much of this metadata looks like it is very similar to the generic descriptive metadata on Data Package and Data Resource so we can probably reuse somewhat.

As a first step would you like to tidy this up into a "pattern" -- our approach atm is first to get a solid pattern, do some community review and then after there's been time to see this is solid and useful to look at whether it goes into the main spec.

CharlesNepote commented 7 years ago

@rufuspollock Do you mean I should propose a new section of the pattern page (with a clear specification)?

[...] do some community review

How? Where?

rufuspollock commented 7 years ago

@CharlesNepote once you have a draft we'll put it on patterns. Review will then

When you submit and we merge the PR 😄
Naturally over time as people try it out
We will flag to e.g. the specs working group

CharlesNepote commented 7 years ago

Here is a very first draft. I wrote it the way the other patterns were written. Is it ok? My english is not as good as I would like to, there are probably mistakes in my proposition.

I tried to keep things simple but many aspects can be discussed:

I used author as in the other frictionless specs while Dublin Core says creator
I didn't use name, not sure it is relevant
I'm hesitating to add "subject" (could be very interesting to build a catalog of schemas)
url could be identifier as in Dublin Core
previous should be named previous_version

===========================================

Schemas Documentation

Overview

Documentation is a fundamental aspect of data sharing. Table Schema needs contextual metadata to let people and softwares understand and manage schemas.

That way, people creating or reusing schemas would have a better understanding of this ones. Some softwares would be able to produce user documentation based on schemas. In the future, some others would be able to use metadata for different tasks such as: verifying schema integrity and provenance, schemas crawling and cataloguing, schemas versioning, etc.

Implementations

There are no known implementations at present.

Specification

To allow schema documentation, implementations MUST start with the schema object. This object contains the already defined fields array, as in a classical Tabular Data Package. It also contains some other properties useful for schema documentation. These other properties comes from widely adopted practices to describe a resource.

Each schema MUST have a title which represent the human-readable title of the schema. The title is the only required property.

All other properties SHOULD be implemented:

author is the author of the schema ; one or more people and/or one or more organisations
publisher is the publisher of the schema ; one or more organisations or one or more people
contributor is a list of contributors of this schema
version is a version number of the schema to let people be sure they talk about the same thing
date is the release date of the schema in a free format
description is where authors can describe the schema in a few sentences; description is highly recommended; it can be a place to use keywords with #hashtags; Markdown format is encouraged
homepage is the home on the web that is related to this schema (not the schema itself); it's a well formed URL
url is the web address where the schema can be retrieved
previous is the URL of the current schema previous version (allowing tools to produce schema diffs)

A user might built a schema.json as follows:

{
  "schema": {
    "title": "Postal codes list schema",
    "author": "Jacques Facteur",
    "publisher": "Postal codes committee",
    "contributor": "Julie Martin, Max Dupont, Estelle Bois",
    "version": "0.1 beta",
    "date": "2017/01/31",
    "description": "Postal codes list schema defines the raw list of postal codes in France.",
    "homepage": "http://example.com/postal-code-list-schema.html",
    "url": "http://example.com/2017/03/13/schema.json",
    "previous": "http://example.com/2017/02/21/schema.json",
    "fields": [
      {
        "name": "postal_code",
        "title": "postal code",
        "type": "string"
      }
    ]
  }
}

joelgombin commented 7 years ago

Thanks @CharlesNepote for this great suggestion. A few remarks:

I'm not sure whether the contributor field should be an array or a single string
in the date field, what do you mean "in a free format"? Do you mean there's no standard format to be followed? In that case I would be worried this data can not be used.
the metadata for a given version of the schema lets you know if there are previous version but not if tehre are more uptodate version...
finally, some of the metadata you can find on schema pages (e.g. http://specs.frictionlessdata.io/table-schema/) is not in your documentation, for example the language or the changelog. Is that intentional? Ideally, shouldn't the web pages presenting the schemas be able to be generated by the schema documentation? Or maybe I'm somehow missing the point?

CharlesNepote commented 7 years ago

I'm not sure whether the contributor field should be an array or a single string.

If it's an array, what is the use case? I think this field is just to thanks contributors and not to a field to allow new usages. Keep it simple.

in the date field, what do you mean "in a free format"? Do you mean there's no standard format to be followed? In that case I would be worried this data can not be used.

I agree. A standard format will allow many interesting use cases: schema cataloguing by date, RSS feeds, etc. I'll precise "ISO-8601" format.

the metadata for a given version of the schema lets you know if there are previous version but not if tehre are more uptodate version...

The homepage field should be used for this purpose. It should the web page where the schema environment is described: history, changelog, versions, etc. I'll precise that in the documentation.

finally, some of the metadata you can find on schema pages (e.g. http://specs.frictionlessdata.io/table-schema/) is not in your documentation

Yes it's intentional. These documentation would take place:

either in the pattern page
either merged with Table Schema specification

Here is the updated spec. @rufuspollock do I have to make a pull request?

=============================================

Schemas Documentation

Overview

Documentation is a fundamental aspect of data sharing. Table Schema needs contextual metadata to let people and softwares understand and manage schemas.

That way, people creating or reusing schemas would have a better understanding of this ones. Some softwares would be able to produce user documentation based on schemas. In the future, some others would be able to use metadata for different tasks such as: verifying schema integrity and provenance, schemas crawling and cataloguing, schemas versioning, etc.

Implementations

There are no known implementations at present.

Specification

To allow schema documentation, implementations MUST start with the schema object. This object contains the already defined fields array, as in a classical Tabular Data Package. It also contains some other properties useful for schema documentation. These other properties comes from widely adopted practices to describe a resource.

Each schema MUST have a title which represent the human-readable title of the schema. The title is the only required property.

All other properties SHOULD be implemented:

author is the author of the schema ; one or more people and/or one or more organisations
publisher is the publisher of the schema ; one or more organisations or one or more people
contributor is a string which list the contributors of the schema
version is a version number of the schema to let people be sure they talk about the same thing
date is the release date of the schema (ISO-8601 format, YYYY-MM-DD should be enough)
description is where author(s) can describe the schema in a few sentences; description is highly recommended; it can be a place to use keywords with #hashtags; Markdown format is encouraged
homepage is the home on the web that is related to this schema (not the schema itself); it should be where the schema environment is described: history, changelog, versions, etc.; this field is a well formed URL
url is the web address where the schema can be retrieved
previous is the URL of the current schema previous version (allowing tools to produce schema diffs)

A user might built a schema.json as follows:

{
  "schema": {
    "title": "Postal codes list schema",
    "author": "Jacques Facteur",
    "publisher": "Postal codes committee",
    "contributor": "Julie Martin, Max Dupont, Estelle Bois",
    "version": "0.1 beta",
    "date": "2017-01-31",
    "description": "Postal codes list schema defines the raw list of postal codes in France.",
    "homepage": "http://example.com/postal-code-list-schema.html",
    "url": "http://example.com/2017/03/13/schema.json",
    "previous": "http://example.com/2017/02/21/schema.json",
    "fields": [
      {
        "name": "postal_code",
        "title": "postal code",
        "type": "string"
      }
    ]
  }
}

pwalsh commented 7 years ago

The family of specs already has fields that deal with title, author/contributor/publisher (contributors), date (created), description. We could reuse these on the table schema spec.

CharlesNepote commented 7 years ago

The family of specs already has fields that deal with title, author/contributor/publisher (contributors), date (created), description. We could reuse these on the table schema spec.

Thanks @pwalsh, I appreciate your help.

I made a global search on each spec and here is my observations.

my title and description properties share the same semantic and format as in frictionlessdata specs
my date might be the same as created and I could change for it: I understand that RFC 3339 comes from ISO-8601 but does it allow "1985-04-12" instead of "1985-04-12T23:20:50.52Z"?
my url property might the same as uri property in frictionless data specs and I could change for it: I know an URL is an URI but I would like to be sure there is no consequences
for author it is strangely mentioned in one of the data packages examples, but it is not specified (is it a bug? should I open an issue?)
I created the publisher, version and previous properties, which can be discussed. I was particularly surprised that publisher have not been used by frictionlessdata specs.
my homepage is not as sophisticated as homepage in data package properties; I can change it (even if I think more complicated to have "homepage" { "name": "My web page", "uri": "http://example.com/" }
my contributor is not as sophisticated as contributors in data package properties: I wonder to know why it is so rich. (By the way, your role is not specified at all, should I open an issue?)

pwalsh commented 7 years ago

Hi @CharlesNepote if you could add the problems / bugs to this issue it would be great https://github.com/frictionlessdata/specs/issues/385

CharlesNepote commented 7 years ago

@pwalsh @rufuspollock

I've updated my spec below, making efforts to take in account existing properties. My proposal creates 3 new properties:

publisher but I think we can abandon it if necessary
version: I think it is important to let people be sure they are speaking of the same thing
previous: is the previous version of the schema, which I found important to ensure schema traceability

I'm still uncomfortable with 2 properties of frictionlessdata:

homepage is too sophisticated in my opinion; it should be only an URI; schema structure should be as flat as possible to let people built schema easily
contributors is worse and not correctly documented; implementations have to test what is the role of the contributor to know what to do with it; again it would be better, IMHO, to have a flat structure with author, contributor, etc. to keep it simple. So I kept author and contributor for the moment.

Are homepage and contributors definitely decided in frictionless data specs?

========================

Schemas Documentation

Overview

Documentation is a fundamental aspect of data sharing. Table Schema needs contextual metadata to let people and softwares understand and manage schemas.

That way, people creating or reusing schemas would have a better understanding of this ones. Some softwares would be able to produce user documentation based on schemas. In the future, some others would be able to use metadata for different tasks such as: verifying schema integrity and provenance, schemas crawling and cataloguing, schemas versioning, etc.

Implementations

There are no known implementations at present.

Specification

To allow schema documentation, implementations MUST start with the schema object. This object contains the already defined fields array, as in a classical Tabular Data Package. It also contains some other properties useful for schema documentation. These other properties comes from widely adopted practices to describe a resource.

Each schema MUST have a title which represent the human-readable title of the schema. The title is the only required property.

All other properties SHOULD be implemented:

author is the author of the schema ; one or more people and/or one or more organisations
publisher is the publisher of the schema ; one or more organisations or one or more people
contributor is a string which list the contributors of the schema
version is a version number of the schema to let people be sure they talk about the same thing
created is the release date of the schema (ISO-8601 format, YYYY-MM-DD should be enough)
description is where author(s) can describe the schema in a few sentences; description is highly recommended; it can be a place to use keywords with #hashtags; Markdown format is encouraged
homepage is the home on the web that is related to this schema (not the schema itself); it should be where the schema environment is described: history, changelog, versions, etc.; this field is a well formed URL
uri is the web address where the schema can be retrieved
previous is the URL of the current schema previous version (allowing tools to produce schema diffs)

A user might built a schema.json as follows:

{
  "schema": {
    "title": "Postal codes list schema",
    "author": "Jacques Facteur",
    "publisher": "Postal codes committee",
    "contributor": "Julie Martin, Max Dupont, Estelle Bois",
    "version": "0.1 beta",
    "created": "2017-01-31",
    "description": "Postal codes list schema defines the raw list of postal codes in France.",
    "homepage": "http://example.com/postal-code-list-schema.html",
    "uri": "http://example.com/2017/03/13/schema.json",
    "previous": "http://example.com/2017/02/21/schema.json",
    "fields": [
      {
        "name": "postal_code",
        "title": "postal code",
        "type": "string"
      }
    ]
  }
}

cbenz commented 6 years ago

Just to let you know, we generated a Gitbook documentation of some schemas we are using, based on the enhancements proposed by @CharlesNepote, sightly adapted.

For example:

schema: https://git.opendatafrance.net/scdl/subventions/blob/master/schema.json
generated doc : https://dev.validata.fr/docs/schemas/scdl-subventions.html
generation script : https://git.opendatafrance.net/validata/validata-doc-generator/blob/master/table_schema_to_md.py

I intend to propose a "pattern" as suggested above, in the next weeks, to reopen the discussion about adding metadata to table schemas, on a concrete base.

rufuspollock commented 5 years ago

@cbenz great - that would be amazing!

johanricher commented 5 years ago

@rufuspollock @pwalsh @frictionlessdata Our team at @jailbreak-paris has increasingly used Table Schema for the past year and a half and we're seeing more and more adoption around us (including at @Etalab). We think it has a bright future. So we've finally rolled up our sleeves and (finally!) got to work on this. Our draft propostion and (most importantly) questions are here: https://github.com/frictionlessdata/specs/pull/627

rufuspollock commented 5 years ago

@johanricher and @cbenz this is awesome, thank-you. I will comment on the PR. I also note you could reuse @CharlesNepote proposal for text (which I think was great - we just did not get it as an actual PR against the patterns page https://github.com/frictionlessdata/specs/blob/master/specs/patterns.md)

johanricher commented 5 years ago

Hey @rufuspollock, thanks for the feedback! We followed your instructions by starting this new PR against the patterns page: #630. As with #627, we built upon @CharlesNepote's idea but also tried to remain closer to the other Frictionless Data specs by using properties from common.yml (e.g. author and publisher as proposed by Charles are redundant with the contributors property).

frictionlessdata / datapackage