cf-convention / discuss

A forum for any discussion about interpretation, clarification, and proposals for changes or extensions to the CF conventions.
43 stars 6 forks source link

Localized metadata in NetCDF files #244

Open turnbullerin opened 1 year ago

turnbullerin commented 1 year ago

Hi Everyone!

So I work for the Government of Canada and I am working on defining the required metadata fields for us to publish data in NetCDF format. We'll be moving a lot of data into this format, so we are trying to make sure we get the format right the first time. The CF conventions are our starting point for metadata attributes.

As the data will be officially published by the Government of Canada eventually, we will have to make sure the metadata is available in both English and French. If the data contains English or French text (not from a controlled list), it needs to be translated too. I haven't found any efforts towards creating a convention for bilingual (or multilingual) metadata and data in NetCDF formats, so I wanted to reach out here to see if anyone has been working on this so we could collaborate on it.

My initial thought is that the metadata should be included in such a way as to make it easy to programmatically extract each language separately. This would allow applications that use NetCDF files (or tools that draw on the CF conventions like ERDDAP) to display the available language options and let the user select which one they would like to see without additional clutter. It should also be included in a way that does not impact existing applications to ensure compatibility.

Of note though is that some data comes from controlled lists where the values have meaning beyond the English meaning. This data probably shouldn't be translated as it would lose its meaning. For many controlled lists, applications can use their own lookup tables to translate the display if they want, and bigger vocabulary lists (like GCMD keywords) can have translations available on the web.

ISO-19115 handles this by defining "locales" (a mix of a mandatory ISO 639 language code, optional ISO 3166 country code, and optional IANA character set) and using PT_FreeText to define one value per locale for different text fields. I like this approach and I think it can translate fairly cleanly to NetCDF attributes. To align with ISO-19115, I would propose two global attributes, one called locale_default and one called locale_others (I kept the word 'locale' in front instead of at the end like in ISO-19115 since this groups similar attributes and I see this is what CF has usually done). The locale_others could use a prefix system (like what keywords_vocabulary uses) to separate different values. I would propose using the typical standards used in the HTTP protocol for separating the language, country, and encoding, e.g. language-COUNTRY;encoding. Maybe encoding and country are not necessary, I'm not sure, I just know ISO included them.

I would then propose using the prefixes from locale_others as suffixes on existing attribute names to represent the value of that attribute in another locale.

For example, this would give us the following global attributes if we wanted to include English (Canada), French (Canada), and Spanish (Mexico) in our locales and translate the title:

  :locale_default = 'en-CA;utf-8';
  :locale_others = 'fra:fr-CA;utf-8 esp:es-MX;utf-8';
  :title = 'English Title';
  :title_fra = 'Titre française';
  :title_esp = 'Título en español';

I was torn if the default locale should define a prefix too, if it did, it would let one use the non-suffixed attribute name for a combination of languages as the default (for applications that don't support localization); for example:

  :locale_default = 'eng:en-CA;utf-8';
  :locale_others = 'fra:fr-CA;utf-8 esp:es-MX;utf-8';
  :title = 'English Title | Titre française';
  :title_eng = 'English Title'
  :title_fra = 'Titre française';
  :title_esp = 'Título en español';

But then this seems like an inaccurate use of locale_default since the default is actually a combo. Maybe English should be added to locale_others in this case and locale_default changed to something like und;utf-8 or even just use the delimiter like [eng] | [fra] to show the format.

I haven't run into a data variable that needs translating yet, but if so, my thought was to define an attribute on the data variable that would allow an application to identify all the related localized variables (i.e. same data, different locale) and which variable goes with which locale. Something like

  var_name_en:locale = ':var_name';      # locale identified in locale_default
  var_name_fr:locale = 'fra:var_name';   # locale identified in locale_others

Thoughts, feedback, any other suggestions are very welcome!

turnbullerin commented 10 months ago

For consistency, we could also have it have a "blank" suffix which I like better than a keyword like "default" (so it would be locales = ":en-US _sv:sv _esmx:es-MX"; in your example)

turnbullerin commented 10 months ago

PS: @turnbullerin Don't be discouraged by this long process, actual changes to the conventions take time and everyone here is a volunteer

Thanks for the pick me up :) I'm not too discouraged, I work for the Government lol. Change takes time and even if I'm more usually of the approach of "well try something and take good notes, then do it better next time", I recognize a major feature like this to a significant and widely used standard will be both contentious and lengthy to agree on. But it's so worth it :). Plus I get paid to have these discussions at work which is nice.

turnbullerin commented 10 months ago

If the top priority is to support existing software (irrespective of age and provenance) then using underscore to implement localization seems as the only option. The drawback is that this introduces the new role for the underscore to be a delimiter between attribute and locale. Moreover, and importantly, CF would then become even more locked down into the current character set restrictions, while the general netCDF community goes to the other extreme by allowing almost all of Unicode.

I think this is good cause to RECOMMEND but not REQUIRE the A-Za-z0-9_ limitation for localization. It lets groups move forward with a more modern version of the attributes where their technology supports it, but gives them the information they need to understand the impact. It also lets them pick a delimiter that isn't misunderstood by whatever other packages they're using if they don't like what we decide as a recommendation/default.

larsbarring commented 10 months ago

@turnbullerin a couple of comments and questions

@larsbarring Your concept seems very similar to what I was proposing and I think it is great.

Yes, it is your idea, no doubt, I was just making some minor adjustments here and there: credit where credit's due.

Regarding which of the following is best I am not sure: locales = "default:en-US ...."; locales = ":en-US ...."; locales = "en-US ...."; I guess the upper one is easier for humans and the middle is more consistent with the fact that for the non-localized attribute there is simply nothing. The lower one I feel is less attractive, but I can live with either. Anyway, do you suggest that the default is mandatory or optional?

I am not sure that I follow when you write:

... suggest we RECOMMEND starting suffixes with an underscore followed by ASCII letters (A-Z and a-z) for maximum compatibility.

Given the current CF limitation to [A-za-z0-9_] for variable and attribute names, which I think might take some time to change, should I understand that you suggest that any of the other characters is acceptable (although not RECOMMENDED), e.g. titleZfrca or title9en as a localized title? (I don't think ... :-) Anyway, I take your point regarding the general direction.

turnbullerin commented 10 months ago

Given the current CF limitation to [A-za-z0-9_] for variable and attribute names, which I think might take some time to change, should I understand that you suggest that any of the other characters is acceptable (although not RECOMMENDED), e.g. titleZfrca or title9en as a localized title? (I don't think ... :-) Anyway, I take your point regarding the general direction.

@larsbarring other than the ":" character, I think it would be acceptable but not recommended practice. It doesn't affect a programmatic interpretation of the attributes, it's just more confusing to human readers. I think people would avoid that anyways. But it would allow things like title__en or title_en_ca to be used. And, if CF ever changes its limitation from A-Za-z0-9_, we won't have to rewrite this paragraph to let people use things like .en-CA as a suffix (but also we aren't dependent on changing that limitation).

Maybe it would be better to say:

we RECOMMEND starting suffixes with an underscore followed by ASCII letters (A-Z and a-z) for maximum compatibility, but suffixes MUST consist of characters allowed for CF attribute and variable names and MUST NOT contain a colon.

Though the last is redundant and perhaps confusing as long as CF doesn't allow colons in attribute/variable names anyways.

As an analogy for why I feel this way, I would note CF doesn't restrict people from doing confusing things in other areas - for example, I can name my variables var1, var2, var3, var4, etc. and this is perfectly CF legal. I don't even need to follow the "spaces are underscores" convention, I can name my variable RelativeHumidity or relativehumidity or relhumid or whatever I feel like (as long as my standard_name is right). I wouldn't say it's a recommended best practice but it doesn't make a file non-compliant.

turnbullerin commented 10 months ago

I'd also add quickly that a REQUIRED format of _[A-Za-z]* still leaves them with lots of room to do silly things - so we're still relying on common sense for human readability. Like the following would still be CF compliant:

locales = "_fr:es";
title_fr = "Spanish Title";

OR

locales = "_suffixOne:fr";
title_suffixOne = "French Title";

We are relying on people to choose suffixes that clearly represent the locale with any system where we let them define a suffix, so my thought is to leave it as open as possible and trust them do something sensible for human readability (as long as we can parse it).

Dave-Allured commented 9 months ago

I have started CF #477 to enable period (.) and hyphen (-) in attribute names only. This is in support of my recommended strategy, attribute.lang-country where lang-country is any BCP 47 language tag. This is proposal 3 in Erin's summary above.

477 is intended to remove one roadblock to adopting proposal 3, or similar strategies that need either the period or hyphen characters. #477 is not intended to express preference or foreclose on any other localization strategies. If you agree with adding these two characters for attribute names only, please post a supporting comment on #477.

larsbarring commented 6 months ago

Hi Erin @turnbullerin, Now when proposal cf-conventions/#477 is accepted, would you be willing, perhaps together with @Dave-Allured, to prepare an enhancement issue and pull request in the cf-conventions repo based on your good start and the comments in this thread?

I think this would be a very useful extension of the CF Conventions.

Many thanks, Lars

DocOtak commented 6 months ago

Hi All, Just getting back into all the CF things after my long expedition (and Ocean Science meeting).

In my opinion, CF should strongly resit adding something to the standard that requires any programatic parsing and interpretation of the attribute keys themselves. Complexities of parsing attributes aside, I'm also concerned about "breaking" ERDDAP. At the Ocean Sciences meeting, all the talks/town halls I went to about the technical implementation of the goals of the UN Ocean Decade had ERDDAP featured somewhat heavily (if any data system was mentioned at all) and I think it is set for becoming the recommended way of serving data in national systems.

rmendels commented 6 months ago

@DocOtak I didn't know we had become so popular!!!! :-)

More seriously if I remember the lengthy discussion related to this (on a different list) for which Bob Simons knows a lot more about this than I, part of the discussion had to do with problems in ERDDAP code and part had to do with breaking clients (mostly where traversing some structure) as well as reading CDL files, I believe there were a few more examples.

DocOtak commented 6 months ago

@rmendels Kevin O'Brien is quite the advocate.

I didn't really want to say "look at my proposal again" since I'm not too attached to it, but my feeling is that this discussion got stuck on what the best way to mangle attributes is and not the possibility of alternatives.

Would folks (@turnbullerin @larsbarring @Dave-Allured others?) be willing to find time for a call to discuss/make progress?

turnbullerin commented 6 months ago

@DocOtak I am happy to make time for a call!

@larsbarring I'm also happy to work on the enhancement and pull request.

My thoughts haven't changed too much, but I agree with a number of key points made, which I'll outline below as a starting point:

  1. We generally agree the goal of the proposal is worthy: a mechanism for internationalizing attribute values at least is of value (data values seem to have fewer good use cases but may also be necessary)
  2. ERDDAP is growing in usage and popularity, and from discussions with them, making major changes in their supported character set seems challenging at best. From this, I infer that using a mechanism that can be supported by ERDDAP as it is today would be beneficial (enabling downstream work on ERDDAP to focus on localization rather than on expanding the character set).
  3. Localized attributes can fairly easily be extracted for programmatic use to localize a display of the underlying dataset

The discussion is now focused on the technical issues of how to implement this.

With this in mind, mangling the names in any way that requires expanding the character set from CF 1.10 is probably a no-go as it goes against 2 - ERDDAP won't be able to easily support these without significant issues.

This leaves us with two options for implementation for attributes:

A. Using a suffix or other alteration of the attribute name to identify them using existing character sets. B. Another approach, such as the one proposed by @DocOtak

Personally, using variables to group together locale-related global attributes seems to be counter-intuitive for me - structurally they're in the incorrect place and for someone not familiar with CF's use of them, it could be confusing. I wonder if a reasonable alternative would be to store triples that map an attribute name to a new attribute name in a given locale, e.g. as follows:

    :locale = "en";
    :localizations = "title fr title_fr;summary fr summary_fr;long_name fr long_name_fr";
    :title = "Title";
    :title_fr = "Titre";
    :summary = "Summary";
    :summary_fr = "Sommaire";

Maybe it would make the localizations attribute too long though? I don't know if there's a maximum length - could also compress it a bit by saying "locale1 en_attr1 fr_attr1 en_attr2 fr_attr2 ; locale2..."?

I'm open to other ideas too! Maybe we can brainstorm other solutions, but I'm still leaning towards a clean and backwards-compatible mangling approach as the easiest to manage.

larsbarring commented 6 months ago

Yes, I am happy to particiapte in a call. /Lars

turnbullerin commented 2 months ago

I am way behind making the call sorry :). Life of a national manager.

I looked a bit more at the parsing and processing side of things though, and I am more strongly leaning towards the suffix-based approach but with user-defined suffixes - I think mandating .en-CA as the format for suffixes is what is most likely to break ERDDAP and other platforms, and I think there's little sense in us mandating how names are mangled (since that is what we are caught up in). There were no objections to a suffix-based approach in the ERDDAP thread in terms of complexity (in fact, Bob more strongly favored title_fr_CA as the format). ERDDAP isn't really set up to handle meta variables from what I can see of its source code and I think it would be far more difficult to implement on their end - ERDDAP basically only supports attributes directly on the global dataset or a variable being output to the user and a variable without data isn't well supported in the XML configuration options.

I think by specifying the allowable suffixes and meanings in an attribute itself, we aren't then interpreting the attribute names themselves, merely the presence or absence of specific names (e.g. title_fr_CA only has meaning if _fr_CA is in the locales attribute as a valid suffix, thus we are parsing the attribute content for that suffix and its meaning). I would agree that specific mangling patterns that have to be untangled by the application processing the file without a supporting attribute is too prone to confusion and difficulties with parsing (e.g. title_fr_CA with no locales attribute to explain what _fr_CA means).

So, given the challenges I foresee ERDDAP having with meta variables, I would propose we move forward with an update based on suffixes. I'll prepare some sample text.

turnbullerin commented 2 months ago

ADDITION TO 2.5 (prior to 2.5.1 heading, after the existing text)

Files that wish to provide localized (i.e. multilingual) versions of the content of variables shall reference section #TBD for details on how to do so.

ADDITION TO 2.6 (prior to 2.6.1 heading, after the existing text following 2.6)

Files that wish to provide localized (i.e. multilingual) versions of the content of attributes shall reference section #TBD for details on how to do so.

NEW SECTION

TBD. Localization

Certain attributes and variables in NetCDF files contain natural language text. Natural language text is written for a specific locale: this defines the language (e.g. English), the country (e.g Canada), the script (e.g English alphabet), and/or other features of the natural language. Locales are defined by a "language tag" that follows the format specified in BCP 47 //link to BCP47 here//, such as en-CA for Canadian English in the default script. This section defines the standard pattern for localizing the contents of a NetCDF file.

Localization of attributes and variables is limited to natural language string values that are not taken from a controlled vocabulary. See Appendix A for recommendations on localization of CF attributes. Non-CF text attributes that use a natural language may also be localized using these rules. To localize an attribute or variable, an alternative version of it is supplied using a suffix for its name that is associated with the language tag.

TBD.1 Localized Files

A "localized file" is one that provides the global attribute localizations. If present, the attribute must contain a space-delimited list words in the format suffix: language_tag. For example, the string default: en _fr: fr-CA _es: es-MX specifies that the default locale of the file is en, that the suffix _fr indicates content in the fr-CA locale, and that the suffix _es indicates content in the es-MX locale. Suffixes may be any text string allowed in an attribute or variable name, but it is strongly recommended that they be chosen for clarity by making them clearly associated with the locale they represent.

The default locale should be chosen to represent the most complete set of attributes and variables; if only some of the natual language text attributes have localized versions, then the more complete language should be chosen as the default. Where there are two or more complete sets, the predominant language that the content was originally written in should be chosen.

An attribute or a variable in a localized file must not have a name ending with a locale suffix unless it is used to indicate the locale as per this section.

Applications that process NetCDF files are encouraged to apply BCP 47 in determining which content to show a user when localized content is available. When content is not available in a suitable locale for the user, the default locale should be used.

TBD.2 Localized Attributes

Localized attributes are created by appending a locale suffix to the usual attribute name. For example:


variables:
    :localizations = "default: en-CA _fr: fr-CA _es: es-MX";
    :title = "English Title";        // English title 
    :title_fr = "Titre française";   // French title
    :title_es = "Título en español"; // Spanish title
    :summary = "English Summary";
    :summary_fr = "Sommaire française";
    // omitted Spanish summary means English will be used instead

    double salinity(i);
    salinity:long_name = "Salinity";
    salinity:long_name_fr = "Salinité";
    salinity:long_name_es = "Salinidad";

TBD.3 Localized Variables

Localized variables are created by appending a locale suffix to the variable name; note that this is only necessary where the data stored in the variable itself is localized and does not come from a controlled vocabulary. Natural language attributes for a localized variable should be provided in the locale of that variable. Localized versions of a variable must be of the same data type and dimensions and must contain the same number of elements appearing in the same order (i.e. weather_obs[0] is the English text and weather_obs_fr[0] is the French text of the same value).


variables:
    :localizations = "default: en-CA _fr: fr-CA _es: es-MX";

    char weather_obs(i);
    weather_obs:long_name = "Weather Conditions";

    char weather_obs_fr(i);
    weather_obs_fr:long_name = "Observations Météorologiques";

    char weather_obs_es(i);
    weather_obs_es:long_name = "Observaciones Meteorológicas"

data:
    weather_obs = "sunny", "rainy", ...;
    weather_obs_fr = "ensoleillé", "pluvieux", ...;
    weather_obs_es = "soleado", "lluvioso", ...;

ADDITION TO APPENDIX A

References https://www.rfc-editor.org/info/bcp47 https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

EDIT NOTES:

  1. I added a note that variables must be the same type, dimensions, and size as each other
  2. I noted the format of cell_methods and updated mine to match
turnbullerin commented 2 months ago

I'd like to especially draw people's attention to the change in Appendix A I put above as there are some open questions there still that have not been answered.

JonathanGregory commented 1 month ago

Dear Erin @turnbullerin

Thanks for your proposal. Although this issue started as a discussion, you're now making a definite proposal to change the convention. Therefore I think it would be appropriate if you began a new issue with this in the conventions repo.

Best wishes

Jonathan

turnbullerin commented 1 month ago

Will do!

JonathanGregory commented 1 month ago

Thanks, @turnbullerin. All interested in Erin's proposal, please comment on #528, and thanks for the discussion up to now.