cf-convention / cf-conventions

AsciiDoc Source
http://cfconventions.org/cf-conventions/cf-conventions
Creative Commons Zero v1.0 Universal
80 stars 43 forks source link

Support localization of natural language attributes and variables #528

Open turnbullerin opened 2 weeks ago

turnbullerin commented 2 weeks ago

Moderator

TBD

Moderator Status Review

None

Requirement Summary

Metadata includes natural language text in several places, notably the title and long_name attributes, as well as potentially in character data variables. Other metadata standards, such as ISO-19115, support the translation of these variables and translation is mandatory in some places such as in files generated by the Canadian Government. By standardizing how these elements are specified in a fashion that is both human-readable and machine-readable, users can identify metadata in their preferred language more easily and computer applications can display metadata to match users preferences and, where this is not possible, then at least while using appropriate accessible techniques. Of key importance is also compatibility with applications such as ERDDAP, which is an application that uses NetCDF files following the CF conventions to create a web interface to select and download data. For this reason, we decided not to use the new .fr-CA suffix as a required format as it would not be compatible with ERDDAP - instead, data providers are free to choose suffixes that meet their use case.

Technical Proposal Summary

Based on discussions in https://github.com/cf-convention/discuss/issues/244, the following proposal seemed acceptable: (1) the creation of a new global attribute that maps suffixes to BCP 47 language tags as well as specifying the default language tag in the file, (2) designating that any attribute or data variable with such a suffix is a localized version of the text in the non-suffixed attribute or variable.

Benefits

Data producers who are required to produce metadata or data in multiple languages, applications that offer multilingual interfaces for viewing or manipulating NetCDF data based on the CF standards, data users who wish to access metadata in the language of their choice

Status Quo

Currently no NetCDF standard offers a standard for localized metadata. However, such standards existing in other metadata formats, such as ISO-19115.

Associated pull request

Not present yet

Detailed Proposal

The addition of a new section to the CF conventions that specifies the following:

In addition, the following changes would be proposed:

turnbullerin commented 2 weeks ago

Here is the draft text we were looking at in the discussion thread

ADDITION TO 2.5 (prior to 2.5.1 heading, after the existing text)

Files that wish to provide localized (i.e. multilingual) versions of the content of variables shall reference section #TBD for details on how to do so.

ADDITION TO 2.6 (prior to 2.6.1 heading, after the existing text following 2.6)

Files that wish to provide localized (i.e. multilingual) versions of the content of attributes shall reference section #TBD for details on how to do so.

NEW SECTION

TBD. Localization

Certain attributes and variables in NetCDF files contain natural language text. Natural language text is written for a specific locale: this defines the language (e.g. English), the country (e.g Canada), the script (e.g English alphabet), and/or other features of the natural language. Locales are defined by a "language tag" that follows the format specified in BCP 47 //link to BCP47 here//, such as en-CA for Canadian English in the default script. This section defines the standard pattern for localizing the contents of a NetCDF file.

Localization of attributes and variables is limited to natural language string values that are not taken from a controlled vocabulary. See Appendix A for recommendations on localization of CF attributes. Non-CF text attributes that use a natural language may also be localized using these rules. To localize an attribute or variable, an alternative version of it is supplied using a suffix for its name that is associated with the language tag.

TBD.1 Localized Files

A "localized file" is one that provides the global attribute localizations. If present, the attribute must contain a space-delimited list words in the format suffix: language_tag. For example, the string default: en _fr: fr-CA _es: es-MX specifies that the default locale of the file is en, that the suffix _fr indicates content in the fr-CA locale, and that the suffix _es indicates content in the es-MX locale. Suffixes may be any text string allowed in an attribute or variable name, but it is strongly recommended that they be chosen for clarity by making them clearly associated with the locale they represent.

The default locale should be chosen to represent the most complete set of attributes and variables; if only some of the natual language text attributes have localized versions, then the more complete language should be chosen as the default. Where there are two or more complete sets, the predominant language that the content was originally written in should be chosen.

An attribute or a variable in a localized file must not have a name ending with a locale suffix unless it is used to indicate the locale as per this section.

Applications that process NetCDF files are encouraged to apply BCP 47 in determining which content to show a user when localized content is available. When content is not available in a suitable locale for the user, the default locale should be used.

TBD.2 Localized Attributes

Localized attributes are created by appending a locale suffix to the usual attribute name. For example:


variables:
    :localizations = "default: en-CA _fr: fr-CA _es: es-MX";
    :title = "English Title";        // English title 
    :title_fr = "Titre française";   // French title
    :title_es = "Título en español"; // Spanish title
    :summary = "English Summary";
    :summary_fr = "Sommaire française";
    // omitted Spanish summary means English will be used instead

    double salinity(i);
    salinity:long_name = "Salinity";
    salinity:long_name_fr = "Salinité";
    salinity:long_name_es = "Salinidad";

TBD.3 Localized Variables

Localized variables are created by appending a locale suffix to the variable name; note that this is only necessary where the data stored in the variable itself is localized and does not come from a controlled vocabulary. Natural language attributes for a localized variable should be provided in the locale of that variable. Localized versions of a variable must be of the same data type and dimensions and must contain the same number of elements appearing in the same order (i.e. weather_obs[0] is the English text and weather_obs_fr[0] is the French text of the same value).


variables:
    :localizations = "default: en-CA _fr: fr-CA _es: es-MX";

    char weather_obs(i);
    weather_obs:long_name = "Weather Conditions";

    char weather_obs_fr(i);
    weather_obs_fr:long_name = "Observations Météorologiques";

    char weather_obs_es(i);
    weather_obs_es:long_name = "Observaciones Meteorológicas"

data:
    weather_obs = "sunny", "rainy", ...;
    weather_obs_fr = "ensoleillé", "pluvieux", ...;
    weather_obs_es = "soleado", "lluvioso", ...;

ADDITION TO APPENDIX A

References https://www.rfc-editor.org/info/bcp47 https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

EDIT NOTES:

  1. I added a note that variables must be the same type, dimensions, and size as each other
  2. I noted the format of cell_methods and updated mine to match
sethmcg commented 2 weeks ago

This looks good. Nice work!

flag_meanings seems like it ought to be locale-aware.

I think it could be hard to make history locale-aware. The history attributes in the files I work with mostly consist of sequences of command-line invocations, which you don't want to translate because then it's not the command that was actually used. When there is a comment, it's usually something that a piece of software added automatically, rather than the file creator, so there's no control over what language it's in. Depending on the tools you use, it seems like it wouldn't be hard to end up with a history attribute that's multi-lingual or in a different language than the one you want to declare as the default.

DocOtak commented 2 weeks ago

I'm struggling with this for reasons I'm having trouble articulating (and am not sure are even valid reasons), I'll try.

I don't think CF should introduce attribute and variable name modifications as a concept. Section 2.5 starts out with:

This convention does not standardize variable names.

With this proposal, it would be except in cases of localization. This proposal also places restrictions on variable/attribute names and their suffixes.

I think the ERDDAP "elephant in the room" is important and relevant here. For ERDDAP adoption in Canada, it needs to display localized content, so this proposal must work in a real world application. I'm fearful that this will not work in ERDDAP for some reason we don't know and won't know until an implementation is worked on and the standard has already been codified.

Would there be willingness on the ERDDAP side to implement something that isn't quite in CF yet? And on the CF side, would we be willing to codify whatever ends up working best in ERDDAP? Realizing that someone involved in both communities might need to actually do the coding.

rmendels commented 2 weeks ago

@DocOtak @turnbullerin As I told Erin, the best way to help see if this can be done is to contribute code to ERDDAP to do so and be tested. There are ready-to-go development environments, compiling and testing have been simplified. Chris is only part-time, but I am sure it would be happy to work with people on this.

turnbullerin commented 5 days ago

For an update, the ERDDAP issue now has a link to a working prototype I developed based on this proposal for the title attribute - see https://github.com/ERDDAP/erddap/issues/114