Closed relet closed 3 weeks ago
I like the json-ld approach of @{lang-code}. I actually had this in the original version of simple data format (but it got removed in the quest for simplicity).
While i18n seems good I do wonder whether the occam's razor for standards should also be applied here: "how essential is this, and how many potential users will care about this feature?"
I agree that it could be omitted, but that decision should then be mentioned in the standard or a FAQ:
I'm starting to think we could at least mention idea of using @ style stuff ...
I actually quite like this but I would focus more on l10n than i18n especially since we're very likely to add foreign keys soon (issue #23). That would mean everybody could point to the same dataset which could include many locales (translations).
What I'm thinking is something like a new optional field for the datapackage specification: alternativeResources
(since we've all of a sudden decided to go for lowerCamelCase instead of the previous _underscorekeywords even if that means we have to break backwards compatibility/consistency -- me not like but that's a different issue).
The form I'm thinking is something like:
{
"name": "dataset-identifier",
"...": "...",
"resources": [
{
"name": "resource-identifier",
"schema" : { "..." : "..." },
"..." : "..."
}
],
"..." : "...",
"alternativeResources" : {
"resource-identifier": {
"is-IS" : {
"path": "/data/LC_messages/is_IS.csv",
"format": "csv",
"mediatype": "text/csv",
"encoding": "<default utf8>",
"bytes": 10000000,
"hash": "<md5 hash of file>",
"modified": "<iso8601 date>"
"sources": "<source for this file>",
"licenses": "<inherits from resource or datapackage>"
},
"de-DE" : { "..." : "..." },
"..." : "..."
}
},
"..." : "..."
At the moment I'm thinking the translations would be files with the exact same schema (so things are duplicated) because that makes it easier to do both translations (copy this file and translate the values you want) and implementation (want to get the Romanian version just fetch this resource instead).
I'm reluctant to calling alternativeResources
something like l18n
, translations
or locales
(even though that's what I'm using to identify the alternative resources) because I would like to be able to have other identifiers like for example "en-GB-simple"
or something like that. For that, I'm thinking about datasets which I have in mind that would, for example, have COFOG classifications. This way the data package for COFOG classifications, could provide the official names for the COFOG categories, but also the simple jargonless version (which are used on WhereDoesMyMoneyGo and the translations of the classifications (the simple ones) like budzeti.ba or hvertferskatturinn.is use.
However that just opens up a new problem: How to standardise "locales/alternativeResources" identifiers? So maybe it's enough to just stick with locales as identifiers and stick to BCP 47. If people decide to create a jargonless version of a dataset then that would be a different dataset (with its own l10n). So we could just call it translations
and live happily ever after.
@tryggvib How often do people actually translate an entire dataset? Is it quite common?
I think this applies to perhaps smaller datasets used with foreign keys. This could be datasets with names of all countries in the world so you can point to them instead of having them only in English, classification datasets like I mention etc. (I think this is the biggest use case).
I also think this is beneficial for datasets created in one non-English speaking country, that you want to make comparable to other datasets, for example as part of some global data initiative, so you would translate it into English and make that available. That way you can make the dataset available in two languages.
As a side note, it might be interesting to start some project to make dataset translations simpler ;)
Hi @tryggvib @rgrp, I found this thread while searching for i18n in datapackage.json. Most common usecase probably is that people will want to describe their dataset in more then a single language. However we've also found some cases where a full dataset is translated in multiple languages.
Looking at json-lds' @language attribute, seems there are three options available (http://www.w3.org/TR/json-ld/#string-internationalization)
{
"@context": {
...
"ex": "http://example.com/vocab/",
"@language": "ja",
"name": { "@id": "ex:name", "@language": null },
"occupation": { "@id": "ex:occupation" },
"occupation_en": { "@id": "ex:occupation", "@language": "en" },
"occupation_cs": { "@id": "ex:occupation", "@language": "cs" }
},
"name": "Yagyū Muneyoshi",
"occupation": "忍者",
"occupation_en": "Ninja",
"occupation_cs": "Nindža",
...
}
or
{
"@context":
{
...
"occupation": { "@id": "ex:occupation", "@container": "@language" }
},
"name": "Yagyū Muneyoshi",
"occupation":
{
"ja": "忍者",
"en": "Ninja",
"cs": "Nindža"
}
...
}
or
{
"@context": {
...
"@language": "ja"
},
"name": "花澄",
"occupation": {
"@value": "Scientist",
"@language": "en"
}
}
first seems to have best backwards compat
To summarize my experience with translations: the translation is on multiple levels: metadata translation and data translation.
The metadata translation is simpler:
Having the localization in the main file might be handy for the package reader, however it has a disadvantage of providing additional translations. One has to edit the file or have a tool that will combine multiple metadata specifications into one file. Much better solution is to have metadata translations as separate objects/files, for example datapackage-locale-XXXX.json
or have a folder with LOCALE.json
files or something like that. Much easier to move translations around. With multiple datasets with the same structure the translation is just about creating a simple copy of a file.
Data translation is slightly different. The localized data can be provided in multiple formats:
Question is: which case we would like to handle? All of them? Only certain ones?
How the translation is handled technically during data analysis process depends on the case:
The most relevant tables to be localized are the dimension tables, therefore I'm going to use them as an example.
As for specification requirements:
As for the denormalized translation: do we want to provide "logical" column name or the original name? For example, the columns might be name_de
, name_en
, name_sk
- do we want to provide only the name_XX
to the user based on the user's language choice or rename it just to name
?
In Cubes framework we are using the denormalized translation and hiding the original column names (stripping the locale column extension) – therefore the reports work regardless of language used. The reports even work when localized column was added to non-localized dataset later. But Cubes is metadata-heavy framework.
@pwalsh @danfowler this is one to look at again.
@rgrp related, my long standing pull request, which deals with i18n in the resources themselves: https://github.com/dataprotocols/dataprotocols/pull/190
@pwalsh I know - I still feel we should do metadata first then data.
I agree that starting with meta-data is a good idea.
My humble suggestion is that each localizable string in datapackage.json could take two forms:
(For the sake of simplicity, I also think that we could limit this to only apply for the title
and description
fields)
For example:
...
"title": {
"": "Israel's 2015 budget",
"he-IL": "תקציב לשנת 2015",
"es": "Presupuestos Generales del Estado para el año 2015 de Israel"
}
...
Since we do lots of "string or object" type patterns in the Data Package specs generally, I'm partial to the suggestion made by @akariv. However, it could get complicated real quick if someone tries to apply this liberally to any string located anywhere on the datapackage.json
descriptor (think: custom data structures of heavily nested objects).
One way to counter that is to limit translatable fields explicitly, but that kind of goes against the flexibility of the family of Data Package specifications in general.
I'd suggest something that follows on from the pattern I suggest for data localisation here
Where:
@
becomes a special symbol in keys, denoting a translated field@
is a language code@
is a property name, expected to match another property of the dp.I also think that the distinction between localisation and translation is important, and would again suggest the same concept as I suggest for data, here. Note that this is not some invention: the pattern I'm suggesting is heavily influenced by my work with translation and localisation using Django, and probably is quite consistent with other web frameworks.
Example:
{
"name": "School of Rock",
"description": "A school, for Rock.",
"name@he": "בית הספר לרוק",
"description@he": "בית ספר, לרוק"
}
@pwalsh a two comments:
title
and description
fields, is that having multiple translations for other fields is probably pointless (and tbh user-supplied fields can use whichever scheme they want).@akariv
On the first point, user-specified fields on Data Package are part of the design of the spec, and with the way the family of specs works, I do think it would be unusual to explicitly say only specific fields are translatable.
On the second point: yes, it would result in a lot of clutter. I guess we have to decide if we are optimising for human reading of the spec too. Al alternate approach would be to group everything by language which would at least be an ordered type of clutter :).
{
"translations": {
"he": { .. all translated properties ...},
... etc ...
}
}
(What I meant was not that only these two fields are translatable, but that only for them the spec specifies a method for translating - and other user-specified fields may use a different scheme - although in second though that may not be the best practice).
As for readability - I think that is definitely a factor (as someone said: "JSON is readable as simple text making it amenable to management and processing using simple text tools")
And your suggestion does improve things in terms of clutter, but it somehow doesn't feel right to me to separate the original value from the translation.
@akariv yes, it is not a simple problem to solve. Maybe we should be optimising for cases of a handful of translations - say: 2-5 languages. And, acknowledging the fact that it is likely that we might expect, say, 2-5 translatable properties on a giving package?
So, I've thought quite a bit about this and I generally agree with @akariv approach:
"title": {
"": "Israel's 2015 budget",
"he-IL": "תקציב לשנת 2015",
"es": "Presupuestos Generales del Estado para el año 2015 de Israel"
}
...
I've updated the main description of the issue with a relatively full spec based on this.
Welcome comments from @frictionlessdata/specs-working-group
title@en
. However, this does not seem common and JSON-LD no longer supports this. As @akariv points out it does not necessarily sort well and bloats the top level of the JSON. Also the @ in json property names is kind of annoying.@rufuspollock agreed.
In my opinion, we do need lang
or languages
as well as the actual handling of translations for properties. See the pattern described here
I prefer the array and the special treatment of the first element in the array, as per my pattern. Another approach, like in Django for example, is LANGUAGE_CODE
for the default lang and an additional LANGUAGES
array for the supported translations. But I'm not convinced of the need for two different properties.
@rufuspollock let's schedule this for v1.1 - there are lots of changes for v1 and they should settle before we introduce translations, esp. as the proposal here uses the dynamic type pattern we moved away from in v1.
@pwalsh agreed.
Hi, no news here (only later v1.1)?
If "real life example" is useful to this discussion ... My approach (while no v1.1) at datasets-br/state-codes's datapackage.json
, was to add lang descriptor and lang-suffix differentiator. The lang
at source level, as default for all fields.
Hum... the interpretation was "language of the descriptions (!and CSV textual contents)".
If some field or descriptor need to use other language, I use a suffix -{lang}
. In the example we used title
as default (en) and title-pt
for Portuguese title.
How should the standard support titles, descriptions and data fields in languages other than English?
Proposal (Nov 2016)
An internationalised field:
Summary:
Each localizable string in datapackage.json could take two forms:
Not all properties would be localizable for now. For the sake of simplicity, we limit this to only the following properties;
Default Language
You can define the default language for a data package using a
lang
attribute:The default language if none is specified is English (?).