frictionlessdata / datapackage

Data Package is a standard consisting of a set of simple yet extensible specifications to describe datasets, data files and tabular data. It is a data definition language (DDL) and data API that facilitates findability, accessibility, interoperability, and reusability (FAIR) of data.
https://datapackage.org
The Unlicense
481 stars 107 forks source link

Promote DataCite-compatible languages support to the specs #925

Open roll opened 2 months ago

roll commented 2 months ago

Overview

As we already have Languages recipe, and there is a de-facto standard way to support languages in DataCite, we might go forward and finally make it to the specs.

cc @augusto-herrmann

augusto-herrmann commented 2 months ago

@roll does DataCite even use Data Packages or Table Schema to begin with? I have skimmed their documentation and all of their examples are in XML. Also, their language support seems to describe the language of the resource, just like the current Table Schema pattern you linked to above, not the language of the metadata.

What I miss is a way to describe the metadata (resource title and description, column names and descriptions) in multiple languages, while the data itself remains in a single language.

Example

Metadata is provided in multiple languages.

in animals.datapackage.en.yaml:

resources:
  - name: animals
    path: animals.csv
    title: Animals
    schema:
      fields:
        - name: id
          type: integer
        - name: animal
          title: Animal species name
          type: string

in animals.datapackage.ru.yaml:

resources:
  - name: animals
    path: animals.csv
    title: Животные
    schema:
      fields:
        - name: id
          type: integer
        - name: animal
          title: Название вида животного (на английском языке)
          type: string

The csv file (the data itself) has only one version, in English:

id,animal
1,cat
2,dog
3,giraffe
4,bat
5,leopard
6,lion
7,tiger
8,elephant
9,panda
10,rabbit
11,chicken
12,cow
13,horse
14,sheep

Pattern

This undocumented pattern already works. We already use it.

The problem is, the typing information (integer, string) and other non-language specific metadata (e.g. null values, validation rules, etc.) have to be repeated in each data package metadata file. That's bad for maintenance, as types and validation rules may evolve and you have to manually keep track of those across several versions of the data package metadata file and keep them in sync. It would be great if I could define those technical metadata only once and in one place.

roll commented 2 months ago

@augusto-herrmann Thanks a lot for writing it down! Just trying to gather all the information now