Open turnbullerin opened 2 weeks ago
Here is the draft text we were looking at in the discussion thread
ADDITION TO 2.5 (prior to 2.5.1 heading, after the existing text)
Files that wish to provide localized (i.e. multilingual) versions of the content of variables shall reference section #TBD for details on how to do so.
ADDITION TO 2.6 (prior to 2.6.1 heading, after the existing text following 2.6)
Files that wish to provide localized (i.e. multilingual) versions of the content of attributes shall reference section #TBD for details on how to do so.
NEW SECTION
Certain attributes and variables in NetCDF files contain natural language text. Natural language text is written for a specific locale: this defines the language (e.g. English), the country (e.g Canada), the script (e.g English alphabet), and/or other features of the natural language. Locales are defined by a "language tag" that follows the format specified in BCP 47 //link to BCP47 here//, such as en-CA
for Canadian English in the default script. This section defines the standard pattern for localizing the contents of a NetCDF file.
Localization of attributes and variables is limited to natural language string values that are not taken from a controlled vocabulary. See Appendix A for recommendations on localization of CF attributes. Non-CF text attributes that use a natural language may also be localized using these rules. To localize an attribute or variable, an alternative version of it is supplied using a suffix for its name that is associated with the language tag.
A "localized file" is one that provides the global attribute localizations
. If present, the attribute must contain a space-delimited list words in the format suffix: language_tag
. For example, the string default: en _fr: fr-CA _es: es-MX
specifies that the default locale of the file is en
, that the suffix _fr
indicates content in the fr-CA
locale, and that the suffix _es
indicates content in the es-MX
locale. Suffixes may be any text string allowed in an attribute or variable name, but it is strongly recommended that they be chosen for clarity by making them clearly associated with the locale they represent.
The default locale should be chosen to represent the most complete set of attributes and variables; if only some of the natual language text attributes have localized versions, then the more complete language should be chosen as the default. Where there are two or more complete sets, the predominant language that the content was originally written in should be chosen.
An attribute or a variable in a localized file must not have a name ending with a locale suffix unless it is used to indicate the locale as per this section.
Applications that process NetCDF files are encouraged to apply BCP 47 in determining which content to show a user when localized content is available. When content is not available in a suitable locale for the user, the default locale should be used.
Localized attributes are created by appending a locale suffix to the usual attribute name. For example:
variables:
:localizations = "default: en-CA _fr: fr-CA _es: es-MX";
:title = "English Title"; // English title
:title_fr = "Titre française"; // French title
:title_es = "Título en español"; // Spanish title
:summary = "English Summary";
:summary_fr = "Sommaire française";
// omitted Spanish summary means English will be used instead
double salinity(i);
salinity:long_name = "Salinity";
salinity:long_name_fr = "Salinité";
salinity:long_name_es = "Salinidad";
Localized variables are created by appending a locale suffix to the variable name; note that this is only necessary where the data stored in the variable itself is localized and does not come from a controlled vocabulary. Natural language attributes for a localized variable should be provided in the locale of that variable. Localized versions of a variable must be of the same data type and dimensions and must contain the same number of elements appearing in the same order (i.e. weather_obs[0]
is the English text and weather_obs_fr[0]
is the French text of the same value).
variables:
:localizations = "default: en-CA _fr: fr-CA _es: es-MX";
char weather_obs(i);
weather_obs:long_name = "Weather Conditions";
char weather_obs_fr(i);
weather_obs_fr:long_name = "Observations Météorologiques";
char weather_obs_es(i);
weather_obs_es:long_name = "Observaciones Meteorológicas"
data:
weather_obs = "sunny", "rainy", ...;
weather_obs_fr = "ensoleillé", "pluvieux", ...;
weather_obs_es = "soleado", "lluvioso", ...;
ADDITION TO APPENDIX A
References https://www.rfc-editor.org/info/bcp47 https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
EDIT NOTES:
cell_methods
and updated mine to matchThis looks good. Nice work!
flag_meanings
seems like it ought to be locale-aware.
I think it could be hard to make history
locale-aware. The history attributes in the files I work with mostly consist of sequences of command-line invocations, which you don't want to translate because then it's not the command that was actually used. When there is a comment, it's usually something that a piece of software added automatically, rather than the file creator, so there's no control over what language it's in. Depending on the tools you use, it seems like it wouldn't be hard to end up with a history attribute that's multi-lingual or in a different language than the one you want to declare as the default.
I'm struggling with this for reasons I'm having trouble articulating (and am not sure are even valid reasons), I'll try.
I don't think CF should introduce attribute and variable name modifications as a concept. Section 2.5 starts out with:
This convention does not standardize variable names.
With this proposal, it would be except in cases of localization. This proposal also places restrictions on variable/attribute names and their suffixes.
I think the ERDDAP "elephant in the room" is important and relevant here. For ERDDAP adoption in Canada, it needs to display localized content, so this proposal must work in a real world application. I'm fearful that this will not work in ERDDAP for some reason we don't know and won't know until an implementation is worked on and the standard has already been codified.
Would there be willingness on the ERDDAP side to implement something that isn't quite in CF yet? And on the CF side, would we be willing to codify whatever ends up working best in ERDDAP? Realizing that someone involved in both communities might need to actually do the coding.
@DocOtak @turnbullerin As I told Erin, the best way to help see if this can be done is to contribute code to ERDDAP to do so and be tested. There are ready-to-go development environments, compiling and testing have been simplified. Chris is only part-time, but I am sure it would be happy to work with people on this.
For an update, the ERDDAP issue now has a link to a working prototype I developed based on this proposal for the title attribute - see https://github.com/ERDDAP/erddap/issues/114
Moderator
TBD
Moderator Status Review
None
Requirement Summary
Metadata includes natural language text in several places, notably the
title
andlong_name
attributes, as well as potentially in character data variables. Other metadata standards, such as ISO-19115, support the translation of these variables and translation is mandatory in some places such as in files generated by the Canadian Government. By standardizing how these elements are specified in a fashion that is both human-readable and machine-readable, users can identify metadata in their preferred language more easily and computer applications can display metadata to match users preferences and, where this is not possible, then at least while using appropriate accessible techniques. Of key importance is also compatibility with applications such as ERDDAP, which is an application that uses NetCDF files following the CF conventions to create a web interface to select and download data. For this reason, we decided not to use the new.fr-CA
suffix as a required format as it would not be compatible with ERDDAP - instead, data providers are free to choose suffixes that meet their use case.Technical Proposal Summary
Based on discussions in https://github.com/cf-convention/discuss/issues/244, the following proposal seemed acceptable: (1) the creation of a new global attribute that maps suffixes to BCP 47 language tags as well as specifying the default language tag in the file, (2) designating that any attribute or data variable with such a suffix is a localized version of the text in the non-suffixed attribute or variable.
Benefits
Data producers who are required to produce metadata or data in multiple languages, applications that offer multilingual interfaces for viewing or manipulating NetCDF data based on the CF standards, data users who wish to access metadata in the language of their choice
Status Quo
Currently no NetCDF standard offers a standard for localized metadata. However, such standards existing in other metadata formats, such as ISO-19115.
Associated pull request
Not present yet
Detailed Proposal
The addition of a new section to the CF conventions that specifies the following:
localizations
, which will be a space-separated list of paired suffixes and BCP 47 language tags (similar to howcell_methods
is formatted): for example:localizations = "default: en-US _fr: fr-CA _es: es-MX";
default
instead of a suffix to indicate the default locale of the documentlocalizations
attribute is present, attributes and variables may not be named with a suffix except that they indicate localized versions of non-suffixed attributes or variablesIn addition, the following changes would be proposed:
title
,comment
,institution
,long_name
,references
,sources
and to be discussedhistory
andflag_meanings
)