I18N and Metadata Translations for Data Package

relet commented 11 years ago

How should the standard support titles, descriptions and data fields in languages other than English?

Proposal (Nov 2016)

An internationalised field:

# i18n
"title": { 
    "": "Israel's 2015 budget", 
    "he-IL": "תקציב לשנת 2015", 
    "es": "Presupuestos Generales del Estado para el año 2015 de Israel" 
}
...

Summary:

Each localizable string in datapackage.json could take two forms:

A simple string (for backward compatibility)
An object, mapping from ISO Locale codes (with or without the region specification, e.g. 'en', or 'es-ES') to their representations.
In this object, you could have an empty key "" which denotes the 'default' representation

Not all properties would be localizable for now. For the sake of simplicity, we limit this to only the following properties;

title (at package and resource level)
description (at package and resource level)

Default Language

You can define the default language for a data package using a lang attribute:

"lang": "en"

The default language if none is specified is English (?).

rufuspollock commented 11 years ago

I like the json-ld approach of @{lang-code}. I actually had this in the original version of simple data format (but it got removed in the quest for simplicity).

While i18n seems good I do wonder whether the occam's razor for standards should also be applied here: "how essential is this, and how many potential users will care about this feature?"

relet commented 11 years ago

I agree that it could be omitted, but that decision should then be mentioned in the standard or a FAQ:

How should I mark data that is not (described) in English language?
How should I handle data that is presented in several languages?
Can I provide these fields in several languages if I want to?

rufuspollock commented 11 years ago

I'm starting to think we could at least mention idea of using @ style stuff ...

trickvi commented 10 years ago

I actually quite like this but I would focus more on l10n than i18n especially since we're very likely to add foreign keys soon (issue #23). That would mean everybody could point to the same dataset which could include many locales (translations).

What I'm thinking is something like a new optional field for the datapackage specification: alternativeResources (since we've all of a sudden decided to go for lowerCamelCase instead of the previous _underscorekeywords even if that means we have to break backwards compatibility/consistency -- me not like but that's a different issue).

The form I'm thinking is something like:

{
    "name": "dataset-identifier",
    "...": "...",
    "resources": [
        {
            "name": "resource-identifier",
            "schema" : { "..." : "..." },
            "..." : "..."
        }
    ],
    "..." : "...",
    "alternativeResources" : {
        "resource-identifier": {
            "is-IS" : {
                "path": "/data/LC_messages/is_IS.csv",
                "format": "csv",
                "mediatype": "text/csv",
                "encoding": "<default utf8>",
                "bytes": 10000000,
                "hash": "<md5 hash of file>",
                "modified": "<iso8601 date>"
                "sources": "<source for this file>",
                "licenses": "<inherits from resource or datapackage>"
            },
            "de-DE" : { "..." : "..." },
            "..." : "..."
        }
    },
    "..." : "..."

At the moment I'm thinking the translations would be files with the exact same schema (so things are duplicated) because that makes it easier to do both translations (copy this file and translate the values you want) and implementation (want to get the Romanian version just fetch this resource instead).

I'm reluctant to calling alternativeResources something like l18n, translations or locales (even though that's what I'm using to identify the alternative resources) because I would like to be able to have other identifiers like for example "en-GB-simple" or something like that. For that, I'm thinking about datasets which I have in mind that would, for example, have COFOG classifications. This way the data package for COFOG classifications, could provide the official names for the COFOG categories, but also the simple jargonless version (which are used on WhereDoesMyMoneyGo and the translations of the classifications (the simple ones) like budzeti.ba or hvertferskatturinn.is use.

However that just opens up a new problem: How to standardise "locales/alternativeResources" identifiers? So maybe it's enough to just stick with locales as identifiers and stick to BCP 47. If people decide to create a jargonless version of a dataset then that would be a different dataset (with its own l10n). So we could just call it translations and live happily ever after.

rufuspollock commented 10 years ago

@tryggvib How often do people actually translate an entire dataset? Is it quite common?

trickvi commented 10 years ago

I think this applies to perhaps smaller datasets used with foreign keys. This could be datasets with names of all countries in the world so you can point to them instead of having them only in English, classification datasets like I mention etc. (I think this is the biggest use case).

I also think this is beneficial for datasets created in one non-English speaking country, that you want to make comparable to other datasets, for example as part of some global data initiative, so you would translate it into English and make that available. That way you can make the dataset available in two languages.

As a side note, it might be interesting to start some project to make dataset translations simpler ;)

pvgenuchten commented 10 years ago

Hi @tryggvib @rgrp, I found this thread while searching for i18n in datapackage.json. Most common usecase probably is that people will want to describe their dataset in more then a single language. However we've also found some cases where a full dataset is translated in multiple languages.

Looking at json-lds' @language attribute, seems there are three options available (http://www.w3.org/TR/json-ld/#string-internationalization)

{
  "@context": {
    ...
    "ex": "http://example.com/vocab/",
    "@language": "ja",
    "name": { "@id": "ex:name", "@language": null },
    "occupation": { "@id": "ex:occupation" },
    "occupation_en": { "@id": "ex:occupation", "@language": "en" },
    "occupation_cs": { "@id": "ex:occupation", "@language": "cs" }
  },
  "name": "Yagyū Muneyoshi",
  "occupation": "忍者",
  "occupation_en": "Ninja",
  "occupation_cs": "Nindža",
  ...
}

or

{
  "@context":
  {
    ...
    "occupation": { "@id": "ex:occupation", "@container": "@language" }
  },
  "name": "Yagyū Muneyoshi",
  "occupation":
  {
    "ja": "忍者",
    "en": "Ninja",
    "cs": "Nindža"
  }
  ...
}

or

{
  "@context": {
    ...
    "@language": "ja"
  },
  "name": "花澄",
  "occupation": {
    "@value": "Scientist",
    "@language": "en"
  }
}

first seems to have best backwards compat

Stiivi commented 10 years ago

To summarize my experience with translations: the translation is on multiple levels: metadata translation and data translation.

The metadata translation is simpler:

define keys which are localizable, such as labels, descriptions and comments
have way how to specify the localized values

Having the localization in the main file might be handy for the package reader, however it has a disadvantage of providing additional translations. One has to edit the file or have a tool that will combine multiple metadata specifications into one file. Much better solution is to have metadata translations as separate objects/files, for example datapackage-locale-XXXX.json or have a folder with LOCALE.json files or something like that. Much easier to move translations around. With multiple datasets with the same structure the translation is just about creating a simple copy of a file.

Data translation is slightly different. The localized data can be provided in multiple formats:

whole dataset copy per language
denormalized translation with one column per language, not all columns might be localized (for example european CPV was provided in this form for all languages)
normalized translation with a column specifying a language (many data of localized apps)

Question is: which case we would like to handle? All of them? Only certain ones?

How the translation is handled technically during data analysis process depends on the case:

The most relevant tables to be localized are the dimension tables, therefore I'm going to use them as an example.

whole dataset: JOIN table based on desired language
denormalized translation: switch columns based on language
normalized: use additional WHERE condition on the language column

As for specification requirements:

whole dataset: we just need to point to another resource with the SAME structure as the original one and assign a language to it
denormalized translation: specify which columns are localized; assign column names to their respective locales
normalized: specify which column contains the language code

As for the denormalized translation: do we want to provide "logical" column name or the original name? For example, the columns might be name_de, name_en, name_sk - do we want to provide only the name_XX to the user based on the user's language choice or rename it just to name?

In Cubes framework we are using the denormalized translation and hiding the original column names (stripping the locale column extension) – therefore the reports work regardless of language used. The reports even work when localized column was added to non-localized dataset later. But Cubes is metadata-heavy framework.

rufuspollock commented 8 years ago

@pwalsh @danfowler this is one to look at again.

pwalsh commented 8 years ago

@rgrp related, my long standing pull request, which deals with i18n in the resources themselves: https://github.com/dataprotocols/dataprotocols/pull/190

rufuspollock commented 8 years ago

@pwalsh I know - I still feel we should do metadata first then data.

akariv commented 8 years ago

I agree that starting with meta-data is a good idea.

My humble suggestion is that each localizable string in datapackage.json could take two forms:

A simple string (for backward compatibility)
An object, mapping from ISO Locale codes (with or without the region specification, e.g. 'en', or 'es-ES') to their representations. In this object, you could have an empty key "" which denotes the 'default' representation

(For the sake of simplicity, I also think that we could limit this to only apply for the title and description fields)

For example:

...
"title": { 
    "": "Israel's 2015 budget", 
    "he-IL": "תקציב לשנת 2015", 
    "es": "Presupuestos Generales del Estado para el año 2015 de Israel" 
}
...

pwalsh commented 8 years ago

Since we do lots of "string or object" type patterns in the Data Package specs generally, I'm partial to the suggestion made by @akariv. However, it could get complicated real quick if someone tries to apply this liberally to any string located anywhere on the datapackage.json descriptor (think: custom data structures of heavily nested objects).

One way to counter that is to limit translatable fields explicitly, but that kind of goes against the flexibility of the family of Data Package specifications in general.

I'd suggest something that follows on from the pattern I suggest for data localisation here

Where:

@ becomes a special symbol in keys, denoting a translated field
What follows @ is a language code
What precedes @ is a property name, expected to match another property of the dp.

I also think that the distinction between localisation and translation is important, and would again suggest the same concept as I suggest for data, here. Note that this is not some invention: the pattern I'm suggesting is heavily influenced by my work with translation and localisation using Django, and probably is quite consistent with other web frameworks.

Example:

{
  "name": "School of Rock",
  "description": "A school, for Rock.",
  "name@he": "בית הספר לרוק",
  "description@he": "בית ספר, לרוק"
}

akariv commented 8 years ago

@pwalsh a two comments:

The reason I suggested we use this pattern only for title and description fields, is that having multiple translations for other fields is probably pointless (and tbh user-supplied fields can use whichever scheme they want).
I really like your suggestion, but don't you think that your scheme might result in a lot of clutter? For example, imagine translating a few fields to 20+ languages? JSON doesn't have any inherent ordering of object keys, which could make things quite messy...

pwalsh commented 8 years ago

@akariv

On the first point, user-specified fields on Data Package are part of the design of the spec, and with the way the family of specs works, I do think it would be unusual to explicitly say only specific fields are translatable.

On the second point: yes, it would result in a lot of clutter. I guess we have to decide if we are optimising for human reading of the spec too. Al alternate approach would be to group everything by language which would at least be an ordered type of clutter :).

{
  "translations": {
    "he": { ..  all translated properties ...},
    ... etc ...
  }
}

akariv commented 8 years ago

(What I meant was not that only these two fields are translatable, but that only for them the spec specifies a method for translating - and other user-specified fields may use a different scheme - although in second though that may not be the best practice).

As for readability - I think that is definitely a factor (as someone said: "JSON is readable as simple text making it amenable to management and processing using simple text tools")

And your suggestion does improve things in terms of clutter, but it somehow doesn't feel right to me to separate the original value from the translation.

pwalsh commented 8 years ago

@akariv yes, it is not a simple problem to solve. Maybe we should be optimising for cases of a handful of translations - say: 2-5 languages. And, acknowledging the fact that it is likely that we might expect, say, 2-5 translatable properties on a giving package?

rufuspollock commented 7 years ago

So, I've thought quite a bit about this and I generally agree with @akariv approach:

"title": { 
    "": "Israel's 2015 budget", 
    "he-IL": "תקציב לשנת 2015", 
    "es": "Presupuestos Generales del Estado para el año 2015 de Israel" 
}
...

I've updated the main description of the issue with a relatively full spec based on this.

Welcome comments from @frictionlessdata/specs-working-group

Research

JSON-LD: http://json-ld.org/spec/latest/json-ld/#string-internationalization
- Has approach we have here plus some other more subtle options related to specifics of JSON-LD
Originally quite liked title@en. However, this does not seem common and JSON-LD no longer supports this. As @akariv points out it does not necessarily sort well and bloats the top level of the JSON. Also the @ in json property names is kind of annoying.
It is also what I see around on the web e.g. https://www.drzon.net/approaches-to-json-internationalisation-i18n/

pwalsh commented 7 years ago

@rufuspollock agreed.

In my opinion, we do need lang or languages as well as the actual handling of translations for properties. See the pattern described here

I prefer the array and the special treatment of the first element in the array, as per my pattern. Another approach, like in Django for example, is LANGUAGE_CODE for the default lang and an additional LANGUAGES array for the supported translations. But I'm not convinced of the need for two different properties.

pwalsh commented 7 years ago

@rufuspollock let's schedule this for v1.1 - there are lots of changes for v1 and they should settle before we introduce translations, esp. as the proposal here uses the dynamic type pattern we moved away from in v1.

rufuspollock commented 7 years ago

@pwalsh agreed.

ppKrauss commented 7 years ago

Hi, no news here (only later v1.1)?

If "real life example" is useful to this discussion ... My approach (while no v1.1) at datasets-br/state-codes's datapackage.json, was to add lang descriptor and lang-suffix differentiator. The lang at source level, as default for all fields.

Hum... the interpretation was "language of the descriptions (!and CSV textual contents)".

If some field or descriptor need to use other language, I use a suffix -{lang}. In the example we used title as default (en) and title-pt for Portuguese title.

frictionlessdata / datapackage