cf-convention / discuss

A forum for any discussion about interpretation, clarification, and proposals for changes or extensions to the CF conventions.

43 stars 6 forks source link

Localized metadata in NetCDF files #244

Open turnbullerin opened 1 year ago

turnbullerin commented 1 year ago

Hi Everyone!

So I work for the Government of Canada and I am working on defining the required metadata fields for us to publish data in NetCDF format. We'll be moving a lot of data into this format, so we are trying to make sure we get the format right the first time. The CF conventions are our starting point for metadata attributes.

As the data will be officially published by the Government of Canada eventually, we will have to make sure the metadata is available in both English and French. If the data contains English or French text (not from a controlled list), it needs to be translated too. I haven't found any efforts towards creating a convention for bilingual (or multilingual) metadata and data in NetCDF formats, so I wanted to reach out here to see if anyone has been working on this so we could collaborate on it.

My initial thought is that the metadata should be included in such a way as to make it easy to programmatically extract each language separately. This would allow applications that use NetCDF files (or tools that draw on the CF conventions like ERDDAP) to display the available language options and let the user select which one they would like to see without additional clutter. It should also be included in a way that does not impact existing applications to ensure compatibility.

Of note though is that some data comes from controlled lists where the values have meaning beyond the English meaning. This data probably shouldn't be translated as it would lose its meaning. For many controlled lists, applications can use their own lookup tables to translate the display if they want, and bigger vocabulary lists (like GCMD keywords) can have translations available on the web.

ISO-19115 handles this by defining "locales" (a mix of a mandatory ISO 639 language code, optional ISO 3166 country code, and optional IANA character set) and using PT_FreeText to define one value per locale for different text fields. I like this approach and I think it can translate fairly cleanly to NetCDF attributes. To align with ISO-19115, I would propose two global attributes, one called locale_default and one called locale_others (I kept the word 'locale' in front instead of at the end like in ISO-19115 since this groups similar attributes and I see this is what CF has usually done). The locale_others could use a prefix system (like what keywords_vocabulary uses) to separate different values. I would propose using the typical standards used in the HTTP protocol for separating the language, country, and encoding, e.g. language-COUNTRY;encoding. Maybe encoding and country are not necessary, I'm not sure, I just know ISO included them.

I would then propose using the prefixes from locale_others as suffixes on existing attribute names to represent the value of that attribute in another locale.

For example, this would give us the following global attributes if we wanted to include English (Canada), French (Canada), and Spanish (Mexico) in our locales and translate the title:

  :locale_default = 'en-CA;utf-8';
  :locale_others = 'fra:fr-CA;utf-8 esp:es-MX;utf-8';
  :title = 'English Title';
  :title_fra = 'Titre française';
  :title_esp = 'Título en español';

I was torn if the default locale should define a prefix too, if it did, it would let one use the non-suffixed attribute name for a combination of languages as the default (for applications that don't support localization); for example:

  :locale_default = 'eng:en-CA;utf-8';
  :locale_others = 'fra:fr-CA;utf-8 esp:es-MX;utf-8';
  :title = 'English Title | Titre française';
  :title_eng = 'English Title'
  :title_fra = 'Titre française';
  :title_esp = 'Título en español';

But then this seems like an inaccurate use of locale_default since the default is actually a combo. Maybe English should be added to locale_others in this case and locale_default changed to something like und;utf-8 or even just use the delimiter like [eng] | [fra] to show the format.

I haven't run into a data variable that needs translating yet, but if so, my thought was to define an attribute on the data variable that would allow an application to identify all the related localized variables (i.e. same data, different locale) and which variable goes with which locale. Something like

  var_name_en:locale = ':var_name';      # locale identified in locale_default
  var_name_fr:locale = 'fra:var_name';   # locale identified in locale_others

Thoughts, feedback, any other suggestions are very welcome!

czender commented 1 year ago

Interesting idea. If you'd like more input/discussion, this could form the basis for a breakout at the upcoming 2023 CF Workshop. ..

turnbullerin commented 1 year ago

oh thats cool - I can't find any info on that yet, I guess more info will be coming later?

ethanrd commented 1 year ago

Hi Erin - The dates for the 2023 CF Workshop (virtual) were just announced (issue #243). There has also been a call for breakout session proposals (issue #233). Further information will be broadcast here as well so everyone watching this repo will get the updates. A web page for the workshop will be added to the CF meetings page in the next month or two.

Zeitsperre commented 1 year ago

Hi @turnbullerin and others,

I wanted to echo my interest in seeing a metadata translation convention come about from the CF Conventions. My team and I have been developing some implementations of metadata translations to better support our French-speaking users, as well as open the possibility of supporting other language translations for climate metadata.

One of our major open source projects for calculating climate indicators (xclim) has an internationalization module built into it for conditionally providing translated fields based on the ISO 639 Language code found within the running environment's locale or set explicitly. For more information, here is some documentation that better describes our approach:

We would love to take part in this discussion if there happens to be a session in October.

Best,

Dave-Allured commented 1 year ago

Erin, I like the general direction of your localization proposal. I would like to suggest a simplified strategy. I do not see a need for those global attributes or the level of indirection represented in them. In short, I suggest simply add ISO-19115 suffixes to standard CF attribute names, as needed. Here are a few more details.

The general form would be attribute_name.lang-country.
lang-country is the two-part locale, exactly as prescribed by ISO-19115.
country is optional, just like you described.
Never use the third element, IANA charset. The entire netCDF name space is fixed on UTF-8, so charset here is unnecessary.
The entire ISO suffix is optional. With that, you are left with just three possible basic forms, and no other complications. The ISO suffixes can now be easily be machine-recognized by aware software:
- attribute_name
- attribute_name.lang
- attribute_name.lang-country

More details:

Define English as the core language of CF controlled vocabulary.
Define English as the universal default locale. This is basically saying, keep the status quo as the default.
When localization is desired, add the lang suffix.
Avoid country except when the data publisher deems it necessary.
attribute_name and attribute_name.en are fully redundant and should not both be applied together.
This localization strategy is intended equally for both CF-controlled and user-defined attribute names.
The primary application is to provide bilingual support. For example, under this scheme you might have long_name.en and long_name.fr on the same data variable.

The choice of the primary delimiter will be controversial. I like period "." for visual flow and general precedent in language design. Some will hold out for underscore as the CF precedent. I think underscore is overused in CF. In particular, the ISO suffix deserves some kind of special character to stand out as a modifier.

The general use of special characters such as "." and "-" is part of proposal https://github.com/cf-convention/cf-conventions/issues/237.

turnbullerin commented 1 year ago

@Dave-Allured

Thanks for your feedback!

I think there is value in the two attributes.

Defining English (and which English, eng-US, eng-CA, eng-UK, etc.) as the universal default is very Anglo-centric. There is a clear use case for datasets produced in other countries to have a primary language that is not English, and documenting it is valuable to inform locale-aware applications processing CF-compliant files. Not everyone will want to provide an English version of every string. So having an attribute that defines the default locale of the text strings in the file is still useful I feel, but perhaps we could define the default if not present as "eng" (no country specified) so that it can be omitted in many cases.

For the other locales, I think it helps applications and humans reading the metadata to know what languages are in the file. If we did not list them, applications would need to be aware of all ISO-639 codes and check each attribute if it exists with any mix of country/language code suffix to build a list of all languages that exist in the metadata. Having a single attribute list them all has a lot of value in my opinion. In unilingual datasets, it can of course be omitted.

This also raises the question on if we should use ISO 639-1 or ISO 639-2/T or ISO 639-3. ISO 19115 allows users to specify the vocabulary that codes are taken from, but if we were to specify one I would recommend ISO 639-2/T for language and ISO 3166 alpha-3 for country (this aligns with the North American Profile of ISO-19115). Alternatively, we could just specify the delimiter and let people override the vocabulary for language and country codes in attributes if they want.

I am torn on the delimiter - I see the value in what you propose, but I would not want to delay this issue if #237 is not adopted quickly and I foresee some technical issues adopting it even if it is agreed to (for example, the Python NetCDF4 library supports attributes as Python variables on the dataset or variable objects [and thus are restricted to [A-Za-z0-9_]; allowing arbitrary names would require them to make a significant change before the standard could be adopted; see https://unidata.github.io/netcdf4-python/#attributes-in-a-netcdf-file).

I do like the idea of standardizing the suffixes though and if we can agree on a format, I support that wholeheartedly. I would propose _xxxYYY where xxx is the lower-case ISO 639-2/T code and YYY is the ISO 3166 alpha-3 country code. If #237 is adopted, .xxx-YYY is also a good solution I think. We could include both for compatibility with applications and libraries that won't support #237 right away if adopted.

Also, I fully agree on UTF-8. It supports all natural languages as far as I know, so there should be no issue with using it as the default encoding. However, I do note that the NetCDF standard allows for other character sets - I guess we are then just saying that all text data must be in UTF-8 (i.e. _Encoding="utf-8")?

In terms of display, I agree with you that locale-aware applications (given a country and language code they should display in) should use the attributes in the following order:

attribute_langCOUNTRY
attribute_lang
attribute

Dave-Allured commented 1 year ago

Erin, thank you for your very thoughtful reply.

Anglo-centric: Yes I was thinking about that when I wrote down my initial thoughts, but I decided to test the waters. I am glad to have triggered that direct conversation. English is a dominant language in the science and business worlds. However, this CF enhancement is a great opportunity for constructs to level the playing field, within the technical context of file metadata.

I agree immediately to the value of a global attribute that sets the default language for the current data file, such that all string attributes with no suffix are interpreted in the specified language. I leave the name of such attribute up to you and others. Yes, keep the default as English if the global attribute is not included.

larsbarring commented 1 year ago

I think adding support for multiple languages to selected CF attribute values would be a great addition. As I have absolutely zero insight into the technical aspects please bear with me if I am asking a stupid question: If this functionality is implemented without an universal default language does it mean that all string valued attributes are expected to follow a specified locale? If so, how would CF attributes that only can take values from a controlled vocabulary be treated, e.g. units, standard_name, cell_methods, axis, calendar?

Thanks, Lars

Dave-Allured commented 1 year ago

List of languages present: It really is no problem to scan a file's metadata, pull off all the language specifiers, and sort them into an organized inventory. This is the kind of thing that can be programmed once, added in to a convenience library, and then used by everybody. If you have a redundant inventory attribute, you immediately have issues with maintenance and mismatches. Such issues will persist forever.

Dave-Allured commented 1 year ago

ISO vocabulary: It would be really nice if CF could settle on single universal choices for thelang and country vocabs. I really like the cadence of [dot] [two] [dash] [three], and no extra steps for alternative vocabularies. Failing that, I would suggest defer to an ISO 19115 self-identifying scheme if there is such a thing. I suppose there could be a vocabulary identifier global attribute, but I would like to avoid that if possible.

turnbullerin commented 1 year ago

@larsbarring I think we would apply this only to natural language attributes, not to those taking their values from a controlled vocabulary.

So title, summary, acknowledgement, etc. are translated; units, standard_name, cell_methods, etc. are not.

Perhaps some form of identification of those would be useful?

turnbullerin commented 1 year ago

@Dave-Allured

I think identifying what is and is not a language specifier might be challenging. Assuming attribute_xxx[YYY] as an algorithm, I would write this:

Look at every attribute name.
If it has an underscore, take the text from the last underscore to the end of the string and continue. Otherwise next attribute name (not a locale).
If it is not either 3 or 6 letters long and either all lower case (if 3) or first three lower case and last three upper case (if 6), then continue (not a locale)
Check that the first three are a defined ISO 639-2/T code and the last three (if present) are a defined ISO 3166 alpha-3 code (requires a list of all valid codes that needs to be updated as ISO makes changes to those vocabularies). If not, continue (not a locale)
Assemble and deduplicate the results

Versus, with an attribute, it is:

Read the attribute and split it by spaces.

I think, while it can be done, having an attribute with all languages in the file greatly simplifies the code for understanding which languages are present (which is the point of some of the metadata, like we could calculate geospatial_max_lon and geospatial_min_lon but we have those for convenience). It also ensures attributes which happen to look like valid localized attributes are not actually treated as such.

Dave-Allured commented 1 year ago

Identifying: Yeah. ;-) Add this to my list of reasons for dot notation.

Suffixes ... We could include both ...

I see great value in settling on a single, optimal syntax up front, and not providing alternative syntaxes. I also value adopting an exact syntax from ISO 19115, rather than having a new CF creation. You already see my preference for dot and dash, and my reasons. I think it is worth holding out for the optimal syntax. I see a growing interest in character set expansion for CF.

The classic netCDF API's included special character handling from the moment of their creation. Python can adapt.

I like 2-letter ISO 639 language codes, but 3-letter will be okay too. Choose one. I defer to your greater expertise on the various ISO flavors. I am not well studied there.

Dave-Allured commented 1 year ago

Erin, take everything I said as mere suggestions. I do not want to bog you down with too much technical detail, right before the upcoming workshop. Good luck!

turnbullerin commented 1 year ago

So, after today's workshop on this, here's a rough draft of what I think we should include for the moment. It is still open for discussion

ADDITION TO 2.5 (prior to 2.5.1 heading, after the existing text)

Files that wish to provide localized (i.e. multilingual) versions of variables shall reference section #TBD for details on how to do so.

ADDITION TO 2.6 (prior to 2.6.1 heading, after the existing text following 2.6)

Files that wish to provide localized (i.e. multilingual) versions of attributes shall reference section #TBD for details on how to do so.

NEW SECTION

TBD. Localization

Certain attributes and variables in NetCDF files contain natural language text. Natural language text is written for a specific locale: this defines the language (English), the country (Canada), the script (English alphabet), and other features. This section defines a standard pattern for localizing a file, which means to specify the default locale of a file and for providing alternative versions of such attributes or variables in alternative locales using a suffix. The use of localization is OPTIONAL. If localization information is not provided, applications SHOULD assume the locale of the file is en.

Localization of attributes and variables is limited to natural language string values that are not taken from a controlled vocabulary. See Appendix A for recommendations on localization of CF attributes. Non-CF text attributes that use a natural language may also be localized using these rules. Locales are defined by a "locale string" that follows the format specified in BCP 47.

Localized files MUST define an attribute locale_default containing a locale string. All natural language attributes and variables without a language suffix MUST be written in this language. The default language of a file should be the one with the most complete set of attributes and variables in that particular language and, ideally, the original language the attributes and variables were written in.

Localized files with more than one locale MUST define an attribute locale_others which is a blank separated list of locale strings. Natural language attributes and variables MAY then be localized by creating an attribute or variable with the same name but ending in [LOCALE], replacing LOCALE with the relevant locale string. Any natural language attribute or variable ending in [LOCALE] must be provided in the given locale.

Applications that support localized NetCDF files SHOULD apply BCP 47 in determining the appropriate content to show a user if the requested locale is not available. If one cannot be found, the default value to display MUST be the attribute without suffix if available. Supporting localization is OPTIONAL for applications.

The following is an example of a file with Canadian English (default), Canadian French and Mexican Spanish with the title and summary attribute translated but missing Spanish summary.

:locale_default = "en-CA";
:locale_others = "fr-CA es-MX";
:title = "English Title";
:title[fr-CA] = "Titre française";
:title[es-MX] = "Título en español";
:summary = "English Summary";
:summary[fr-CA] = "Sommaire française";

An application supporting localization would display the following:

Selected Language	en-CA	fr-CA	es-MX	jp
Title	English Title	Titre française	Título en español	English Title
Summary	English Summary	Sommaire française	English Summary	English Summary

ADDITION TO APPENDIX A

Add a column for "Locale-Aware" (Y or N) or maybe add a new data type of S for non-locale-aware string and S-L for locale-aware string?
Locale-aware string attributes:
- comment
- flagmeanings (? they have underscores but would be helpful ?)_
- history (? I feel like this will be complex since they are automatically updated but having a translated version of the history would be helpful ?)
- institution
- long_name
- references
- source
- title

References https://www.rfc-editor.org/info/bcp47 https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

DocOtak commented 1 year ago

@turnbullerin I did more research after our post meeting discussion:

Here is the W3C Language Tags and Locale Identifiers for the World Wide Web which basically summarized BCP 47, somewhat interesting, it requires that you cite BCP 47 and not any of the underlying RFCs.
In the W3C document, the language subtags (e.g. "en", "fr") comes from a controlled list published by the IANA (the IP address authority). This list is extensive, I searched for a few of the languages on Wikipedia's List of endangered languages in Canada and every language I checked had an entry in the IANA vocabulary. I even checked some of the indigenous languages I've personally been exposed to (Kumeyaay, Hawaiin Pidgin, Louisiana Creole) all had entries. Even Star Trek's Klingon language has an entry. The W3C document specifically mentions this list to avoid ambiguity as to which ISO 639 to use (and I think remaining compatible with it)
Wikipedia has a nice summary of these IETF language tag and how the various RFCs and everything relates. Extension U might be of interest.

DocOtak commented 1 year ago

Here is a CDL strawman for what I was asking about regarding namespacing:

netcdf locale {

// global attributes:
        :locale_default = "en-CA" ; // how to interpret non namespace attrs
        :locale_others = "fr-CA, es-MX" ; // format that matches the Accept-Language priority list in HTTP
        :title = "English Title" ;
        string :fr-CA\:title = "Titre française" ; // the : is escaped by nc dump, I made this netCDF file with python
        string :es-MX\:title = "Título en español" ; // and the netCDF4 python library forces string type if non ASCII code points exist
}

I think @ethanrd said there was an attribute namespace discussion, my quick searching couldn't find it. I would suggest that : becomes a reserved character in CF for locale in attribute names.

Happy for more discussion on this at tomorrow's (or Thursday's session). I also have some code I'd like to share.

turnbullerin commented 1 year ago

Will update to cite BCP 47 explicitly - I imagine that's so that if the underlying RFCs change, the reference doesn't have to change. I think the IANA list is fine (I imagine it's what the RFCs refer to) and we can include a link.

Rather than following the Accept-Language in HTTP, I think we should match the current standard for lists in CF (space-delimited, no commas).

Here's a (very old) discussion I found on namespacing: https://cfconventions.org/Data/Trac-tickets/27.html

Personally I find namespacing for languages confusing, namespacing is usually to group things of a common type rather than a more specific version of a thing. Instead of namespacing at the beginning, maybe instead we could reserve a trailing set of square brackets for containing a locale? Like title[fr-CA] (kinda looks like xpath then)? As long as the unicode issue is resolved and going in soon - if it's rejected, maybe we can just replace the hyphens with underscores (so title_fr_CA).

DocOtak commented 1 year ago

@turnbullerin Adding this here so it isn't lost in the zoom chat.

I coded up some examples using python and xarray (the ncdump CDL is at the bottom) https://github.com/DocOtak/2023_cf_workshop/blob/master/localization/localized_examples.ipynb

My takeaway from the unicode breakout was that the proposal will not be rejected, but details need to be worked out. So we can expect any of the options that use attribute names outside what is currently allowed to be OK in the future.

turnbullerin commented 1 year ago

Thanks for the coding example!

I was looking into what ERDDAP supports and apparently it only supports NetCDF attributes that follow the pattern [A-Za-z_][A-Za-z0-9_]*. I will flag this to them to see if we can gain some traction for updating that fairly quickly as it will be a showstopper until it is for me personally. While maybe we should consider that there might be other libraries that will choke on a full Unicode attribute name, I'm not sure we should be making decisions solely based on what libraries have chosen to do (especially when it doesn't align with what NetCDF allows to start with).

datasets.xml error on line #184
While trying to load datasetID=cnodcPacMSC50test (after 1067 ms)
java.lang.RuntimeException: datasets.xml error on or before line #184: In the combined global attributes,    attributeName="publisher_name[en]" isn't variableNameSafe. It must start with iso8859Letter|_ and contain only iso8859Letter|_|0-9 .
 at gov.noaa.pfel.erddap.dataset.EDD.fromXml(EDD.java:486)
 at gov.noaa.pfel.erddap.LoadDatasets.run(LoadDatasets.java:364)
Caused by: java.lang.RuntimeException: In the combined global attributes, attributeName="publisher_name[en]" isn't variableNameSafe. It must start with iso8859Letter|_ and contain only iso8859Letter|_|0-9 .
 at com.cohort.array.Attributes.ensureNamesAreVariableNameSafe(Attributes.java:1090)
 at gov.noaa.pfel.erddap.dataset.EDD.ensureValid(EDD.java:829)
 at gov.noaa.pfel.erddap.dataset.EDDTable.ensureValid(EDDTable.java:677)
 at gov.noaa.pfel.erddap.dataset.EDDTableFromFiles.<init>(EDDTableFromFiles.java:1915)
 at gov.noaa.pfel.erddap.dataset.EDDTableFromNcFiles.<init>(EDDTableFromNcFiles.java:131)
 at gov.noaa.pfel.erddap.dataset.EDDTableFromFiles.fromXml(EDDTableFromFiles.java:501)
 at gov.noaa.pfel.erddap.dataset.EDD.fromXml(EDD.java:472)
 ... 1 more

That said, they have to consider other metadata formats as well, so there might be restrictions in those.

MathewBiddle commented 1 year ago

From the ERDDAP docs

destinationNames MUST start with a letter (A-Z, a-z) and MUST be followed by 0 or more characters (A-Z, a-z, 0-9, and _). ('-' was allowed before ERDDAP version 1.10.) This restriction allows data variable names to be the same in ERDDAP, in the response files, and in all the software where those files will be used, including programming languages (like Python, Matlab, and JavaScript) where there are similar restrictions on variable names.

turnbullerin commented 1 year ago

@MathewBiddle yeah, that's going to be an issue - that said, https://github.com/cf-convention/cf-conventions/issues/237 has identified several very good use cases where these restrictions are not reasonable for the description of scientific variables (notably some chemistry names that include apostrophes, dashes, and commas) so I don't think that is going to block this change.

MathewBiddle commented 1 year ago

I see you created an issue in the ERDDAP repo, so I'll comment over there on the specifics for ERDDAP.

I just need to say that this is a fantastic proposal and I'm glad to see such a robust conversation here.

turnbullerin commented 1 year ago

After discussions with the ERDDAP people, I think a full Unicode implementation is going to take a long time and I suspect there are other applications out there who will also struggle to adapt to the new standard. There are a lot of special characters out there that have special meanings ([] is used as a hyperslab operator in DAP for example) and I'm concerned about interoperability if we do something that greatly changes how names usually work.

I would propose that we then stick to the current naming convention for attributes and variables in making a proposal for localization (possibly using the double underscore to make it clearly a separate thing) for now since it would maximize interoperability with other systems that use NetCDF files. We could keep the prefix system or we could just use the locale but replacing hyphens with underscores (so title_en_CA and title_fr_CA).

Dave-Allured commented 1 year ago

Here are some further suggestions.

BCP 47 is an excellent choice for the referred standard. It was designed for data applications such as netCDF, among other things.
In the CF spec, the first mention should use a full reference such as "IETF BCP 47 language tags" as demonstrated in the Wikipedia title. Following references should be the shorthand BCP 47.
BCP 47 references other standards for elements such as language and region. Therefore, do not mention these other standards in the proposed CF text. Including would be distracting and might conflict with future evolution of BCP 47.
By all means, show a few examples.
BCP 47, RFC 5646, 4.1 Choice of Language Tag includes recommendations for minimizing tags. IMO this is important enough to be paraphrased in CF.

"A subtag SHOULD only be used when it adds useful distinguishing information to the tag. Extraneous subtags interfere with the meaning, understanding, and processing of language tags."
Use the exact BCP 47 syntax in the content of thelocale attribute, and related attributes. These are string contents, not the attribute names themselves. In particular, keep using the ASCII hyphens, as prescribed. E.g. locale = "fr-CA".
Tags attached to attribute names continue to be a difficult topic. Let us consider that to be a side conversation.

turnbullerin commented 1 year ago

@Dave-Allured excellent points. I will rewrite as suggested and will shift the text to its own repo here so we can do a pull request when we're done.

After thinking about this a lot, I think I'm seeing some good real use cases for why one might not want to follow a particular naming convention - in certain contexts, some characters might be more challenging to use and predicting them all is difficult (see my post on the Unicode thread for reserved characters in different contexts). Making what I think of as a fairly core feature of metadata (multilingualism) dependent on Unicode support or even broader US-ASCII support is maybe not the best choice. Downstream applications relying on NetCDF files might specify their own standard. That said, using an alternative naming structure like [en] or .en makes it fairly clear that it isn't part of the variable name and follows NetCDF core rules, so I do like it. I just am concerned about the interoperability.

My suggestion to resolve this would be to define the default behaviour suffixes like [en] or .en and allow users to alter the suffixes by providing a map instead of a list in locale_others. In support of that, I would ban colons and spaces in suffixes and locales (which I don't think BCP 47 allows for anyways) for clarity. So, for example (using the [en] pattern as the default without prejudice here), these three configurations would be valid:

# Example 1
:locale_others = "fr";
:title[fr] = "French Title";

# Example 2
:locale_others = ".fr: fr";
:title.fr = "French Title";

# Example 3
:locale_others = "_fr: fr";
:title_fr = "French Title";

The code for it, in Python, would be something like

def parse_locale_others(other_locales: str) -> dict[str, str]:
    locale_map = {}
    pieces = [x for x in global_attributes['locale_others'].split(' ') if x != '']
    i = 0
    while i < len(pieces):
        if pieces[i][-1] == ":":
            locale_map[pieces[i][:-1]] = pieces[i+1]
            i += 2
        else:
            locale_map[f"[{pieces[i]}]" = pieces[i]
            i += 1
    return locale_map

def localized_title(metadata: dict[str, typing.Any]) -> dict[str, typing.Optional[str]]:
    default_locale = metadata['locale_default'] if 'locale_default' in metadata else 'en'
    other_locales = parse_locale_others(metadata['locale_others']) if 'locale_others' in metadata else {} 
    titles = {
        default_locale: metadata['title'] if 'title' in metadata else None
    }
    for locale_suffix in other_locales:
        localized_title_key = f"title{locale_suffix}"
        titles[other_locales[locale_suffix]] = metadata[localized_title_key] if localized_title_key in metadata else None
    return titles

Edit: We can also add text strongly suggesting people use the default unless there is a good reason not to.

turnbullerin commented 1 year ago

@Dave-Allured @DocOtak pinging you two since you were the most active contributors. I've updated the draft and spun it into a repo so we can do a pull request when we're ready:

https://github.com/turnbullerin/cf-conventions

The changes are in Appendix A, Chapter 2, and the new Chapter 10.

aulemahal commented 1 year ago

Hi all,

I am very happy that this issue is going forward! I am one of the core developers of xclim and we have been supporting something similar for a few years now (see example here and internal mechanism here). We use the "{attribute}_{locale}" syntax for attribute names. As we are a team based in Montréal, Can, we usually work with french, and that's the only language xclim's indicators can have translation for by default.

I have a few comments on the proposed convention.

Suffix

Given that the convention is to support having suffixes in variable names too, I would favor the underscore over the dot. Many programming languages (and especially Python) use the dot as "level" separator in names which would interfere here. For example, with opening a netCDF with xarray :

ds.tas  # gets the "tas" variable of the dataset
ds.tas.fr  # Would fail with AttributeError : "tas" has no attribute "fr"

Moreover, from this thread it seems some existing implementations of netCDF don't support dots in attributes and variable names ? Using an underscore might make this convention easier to implement and accelerate it's usage ?

I understand how the underscore is more ambiguous as it's already used in some CF attribute names, but with my (limited, I must say) knowledge, I fear the technical problems of the dot would outweigh that ambiguity. As the suffixes are already declared in locale_others, the ambiguity stays only for the very few cases where the suffix is also a real english word (_it maybe?).

Not a strong opinion though :).

`locale_default`

I'm not sure what's the use of that attribute. It does structurally make sense in the proposed convention, but isn't the current "default locale" already english ? Is it currently considered CF-compliant to write any of the "natural language" attributes in another language than english ? Attributes names, standard names, region names, the documentation, the netCDF implementation, this thread, almost everything is in english only.

I would suggest making the implicit explicit and simply stating the non-localized attributes are to be in english[^1].

Thanks to @turnbullerin and all others for this work! I promise xclim will try to be the first app to implement the new convention.

[^1]: And I am making this suggestion as someone from a culture known for its sensitivity to language matters. Que mes ancêtres me pardonnent. ;) .

turnbullerin commented 1 year ago

@aulemahal thanks for the feedback! Also hello fellow Canadian (I'm based in Ottawa)

Personally, I agree with the underscore but also I note the complexities of representing other locale strings (like the more specific fr-CA for Canadian French vs fr-FR for France French, or different scripts for Chinese). The hyphen is also an issue in many instances (e.g. variable_fr-CA is a problem, though a hyphen to underscore conversion is possible, i.e. variable_fr_CA).

Between ERDDAP (and DAP2 itself) and xarray, I think we have some great use cases on why we at least need an alternative to the dot or square bracket syntax but given that NetCDF itself allows dots, hyphens and other such characters in variable names, my thought was to have a default practice based on what NetCDF will support, but with a mechanism for changing the pattern in a programmatically identifiable way.

The CF specification specifically allows natural language text attributes not from controlled lists and variables to be in any language (see 2.3 Naming Conventions: https://cfconventions.org/cf-conventions/cf-conventions.html#_naming_conventions), so locale_default is going to be the mechanism by which this language is identified. It is certainly common for it to be English though and probably a good assumption if someone didn't specify locale_default. That said, I think we should update 2.3 as well to highlight this change. The standard names and other controlled lists (like calendars, etc.) will continue to be in English only.

larsbarring commented 1 year ago

Freely admitting that I am not nearly an software developer or an expert on any of the matters dealt with here, I nevertheless spent an hour or two playing around with some of the concepts discussed here. The reason being that I think this a really worthy use-case where it might be justified/necessary to expand the character set allowed for attribute names (I am deliberately excluding variable names here). Moreover if a non-expert manages to find something then there might be some promise ;-)

Anyway I took the test strings @DocOtak used in the linked code and tried it with NCO

declare -a a
a[1]="locale_default,global,c,c,en-CA"
a[2]="locale_others,global,c,c,""fr-CA es-MX jp tlh"""
a[3]="title,global,c,c,""English Title"""
a[4]="title.fr-CA,global,c,c,""Titre française"""
a[5]="title.es-MX,global,c,c,""Título en español"""
a[6]="title.jp,global,c,c,""日本語のタイトル""" 
a[7]="title.tlh,global,c,c,""Heghlu’meH QaQ jajvam"""

cp test_loc.nc test_loc3.nc
for ((i = 1; i <= ${#a[@]}; i++)); do
   echo "${a[$i]}"
   ncatted -h -a "${a[$i]}" test_loc3.nc
done

and it works as expected with ncdump and ncview (of course!). Before that I tried with the "title[fr-CA]" variant and it worked too, but I had some rookie problems getting the " ' \ characters right in bash, so I personally found this variant neater.

More interesting is that I could without problem read the resulting file into Matlab (R2019a), as well into Iris. And I have asked colleagues to check with arcGIS and QGIS if/when they have some time left to spend.

EDIT: And I got a quick response that at least they could read in the file to both arcGIS and QGIS and display the field. If someone else wants to try out with their favorite tool or system I include the file (~3.8 Mb) here below with .txt added to allow uploading.

test_loc3.nc.txt

sethmcg commented 1 year ago

The reason being that I think this a really worthy use-case where it might be justified/necessary to expand the character set allowed for attribute names (I am deliberately excluding variable names here)

In my experience, attribute names and variable names both get used in very similar contexts when programming code to manipulate netcdf contents, so I would say that all of the concerns expressed here and in #237 about allowing characters like whitespace or special characters in variable names also apply to attribute names.

However, in the example above, the only additional characters used in attribute names are . and -, and those are among the ones that I would regard as the most reasonable to add to the list of allowed characters. I am much more hesitant about [] and :, and would prefer not to add those.

With regard to attribute values, I see no technical issues with allowing any unicode that creates printable characters, so those examples all look just fine to me. (Although if people want to put attributes with values like "☁❄⌛✈😀" in their files, they probably deserve whatever they get as a result...)

larsbarring commented 1 year ago

@sethmcg in essence I very much agree with what you write regarding being cautious in expanding the character set, e.g by using a white list as you suggested. We should do so only when there is a concrete and worthy use case and after careful consideration of pros and cons. And as I wrote I think localization is such an example.

IF there is convergence towards that the format based on . and - would solve the problem of localization (the "pros"), then the question what concrete "cons" there might be (I think that we should try to be as concrete as possible here). That is why I spent the time cooking up the test file and trying it on various software that I have access to. And I was pleasantly surprised that Matlab as well as ArcGIS and QGIS had no problem (the two GIS were perhaps not tested in depth) , despite what @turnbullerin wrote. And Erin showed that using netCDF4 directly in python also worked.

Whether the specific need to allow a few new characters in attribute names also means that the same characters should be allowed in variable names is, I suggest, a matter to discuss in cf-conventions/#237.

By "publishing" the test file (link to file) I invite others to do the same (such tests can of be done using other means too).

turnbullerin commented 1 year ago

@larsbarring @sethmcg

I think the biggest con I've come up with is that ERDDAP won't allow it (ERDDAP uses CF compliant NetCDF files as a basis for providing a DAP2 web service and other data conversion tools). Since the number of output data types are so broad, they have in general only allowed alphanumeric characters and the underscore in attribute and variable names and are hesitant to change it since the impact can be so broad.

Personally, for me, I worry about introducing a major feature like localization in a manner that disrupts existing tools that might rely on the current convention for attribute and variable naming. I think predicting all the downstream consequences is difficult.

To address this, I would propose we either make the default _language_COUNTRY_etc (replace hyphens with underscores in the locale tag) or we force people to specify their own suffix in the locale_others attribute (e.g. _fr: fr-CA to make the suffix _fr). I like the last one because it means we don't have to debate what an appropriate suffix is - people can do what works best for their use case while still having a programmatic method of locating and interpreting them.

I see a lot of the good reasons for introducing new characters (to make them clearly separate from existing names) but I think concerns over breaking existing tools are a stronger argument. In addition, letting users specify the locale suffix allows existing use cases like @aulemahal 's to make minimal changes (simply adding locale_default and locale_others will be sufficient to bring them into compliance) and also lets users adapt to their own specific circumstances going forward. If the proposal to expand to full unicode support passes, then either .lang-COUNTRY or [lang-COUNTRY] can be used as one sees fit without changing the section we introduce here.

rmendels commented 1 year ago

@turnbullerin @larsbarring @sethmcg

" ERDDAP won't allow it ". I think the discussion that was had with us on this topic was a little more nuanced then being presented in this statement. It was pointed out that if required, rather than allowed, that it would break a number of things in ERDDAP right now. Also, since ERDDAP goes over the wire, and not reading a local file, even if it didn't break a number of things, the amount of coding and testing we would have to do to insure that there is nothing insecure in the request, and to differentiate valid names from other things, would be a lot, and that is a lot of burden on us - and with all the security concerns we get we need to be triple certain that nothing we do causes a security breach . Also, while we haven't tested it, clients in many cases access or write things using structures like download.variable_name (similarly with attributes which might be like sst.attribute_name), and these may well be broken in some cases. I think it is really important to look carefully at might be broken first, rather than after the fact, and if things will be broken how much effort would it take to fix that and on whom the burden will fall to fix what is broken.

Unless someone gives us endless resources for programming (and maybe not even then) it is highly unlikely ERDDAP will change its restrictions on variable names anytime soon. As it is we have a long list of requests from users for new features that would provide more bang for the buck than what it would take to implement this fully, such as adding support for projections.

So the tl;dr version is we fully understand why this proposal has been put forward and its potential benefits in some use cases, but if required, rather than allowed, it would put too large of a burden on us and is very unlikely to happen anytime soon.

rmendels commented 1 year ago

@turnbullerin @larsbarring @sethmcg

I should add that you can see the full discussion here.

https://github.com/ERDDAP/erddap/issues/114

Plus we try to be as software agnostic as possible, so if things break in even one major application that is presently working just fine, that is a real disincentive to make a change.

Dave-Allured commented 1 year ago

I would like to simplify by making it explicit that internationalization of variable names or attribute names, such as "title, titre, título", is not part of this proposal. Reviewing Erin's introduction, I do not think that was the original intention at all, of "Localized metadata". I think the intent was only to target the contents of text attributes and text variables, not their names. This should resolve some of the objections from ERDDAP.

Is there agreement on that? Are there any objections to this restriction?

taylor13 commented 1 year ago

I would support limiting the proposal to contents (consistent with @Dave-Allured 's interpretation of the original proposal).

turnbullerin commented 1 year ago

@Dave-Allured you are precisely correct, localizing the names of standard attributes is not something I want to put on the table (whether or not variable names need to be in English is not defined by the CF convention today and I dont want to change that. In fact, the CF conventions currently note that standardized vocabularies and standard names are to be in English and this will not change.

To summarize where we are at...

The only change I'm looking for is a way to specify both the English and French titles and programmatically denote which is which. The mostly settled part is to use the BCP 47 language tags to identify them and to put the list of those in the file in a global attribute, put the localized content in attributes/variables with the same name but adding a suffix based on the locale, and to identify the "default" language tag that applies to the non-suffixed attributes/variables.

The debate is mostly on how we generate those suffixes. Of note, a language has ASCII alphabet characters, numbers and hyphens. There have been five viable proposals so far:

We map a suffix to the language tag in the global attribute and let data creators decide what suffix makes the most sense. This is my favourite as it maximizes flexibility and doesn't require any other updates to CF. It also lets people use an expanded Unicode set if they want to but avoid it if it will cause more issues. However, it leads to less standardized names across files. As an example, if you had locale_others = "_fr: fr-CA"; your French title would be in title_fr.
We do a suffix starting with an underscore and replacing hyphens with underscores (i.e. title_fr_CA). I find this one harder to read but it also requires no updates and will lead to a more standardized approach.
We format them as .TAG , e.g. title.fr-CA. This requires adding the period and hyphen to the allowed characters of attributes and variable names
As option 3 but title[fr-CA]. This requires adding two braces and the hyphen. This reads cleaner to me but is maybe more difficult to implement.
Instead of a suffix, we use the namespacing feature that was proposed awhile ago, i.e. fr-CA:title. This is my least favorite since I think the namespacing will cause issues if you want to use another namespace and localize the attribute as well.

The discussion on expanding attribute and variable names only impacts whether options 3-5 are possible. Options 1-2 require nothing new as the contents of these in NetCDF support full UTF-8. There are other good reasons to expand the character set for names but I don't want to debate them here.

That there is a lot of complexity with downstream tools just makes me lean more towards 1 or 2.

@rmendels also thank you for adding a lot of useful nuance to my comment :-).

DocOtak commented 1 year ago

How about an approach that uses meta variables to contain localization information? This approach being inspired by how geometry_containers work in CF. I've coded up an example that took this to the extreme as I extended the idea all the way to localizing the data itself, I'll try to explain it here.

We reserve 2 or 3 new attribute names that apply to global (and potentially variable) attributes:
- locale - a string containing a single BCP 47 locale identifier
- localizations - a string containing a space separated list of variable containing localized attributes for this scope, global or variable.
- (optionally) localized_data only on variables, indicates that the data itself should be localized.

For the attributes:

All CF attributes (and ACDD ones or whatever), continue to use the standardized English attribute names. The locale of the values of those names is contained in the new attribute locale which must contain a BCP 47 locale tag.
If other localizations are available, the attribute localizations must contain a space separated list of other variable names (like the coordinates attribute in data variables) in the dataset.
On a data variable, the special attribute localized_data may be present with some truthy value (I used 1) that indicates the localization providing meta variable also contains localized data that should replace the data.

Localization providing meta variable:

The actual variable name of the meta variables are not controlled, but must follow other naming restrictions already in CF or your environment (ERDDAP, matlab, etc..) so they may appear in that space separated list.
A variable referenced by the localizations attribute is a localization meta variable
This variable contains a locale attribute with a BCP 47 locale tag with the locale of the attribute values
All other attributes on this meta variable are localized versions of the attributes in the referencing scope (global or variable), e.g. title would still be title. Not all attributes of the referencing variable must also be present in on the meta variable, only the localized attributes. I.e. the meta variables attributes must be a strict subset of the referencing variable.
If the localized_data attribute of the referencing variable is set, then this meta variable must contain data with the same shape as the referencing variable.

Other notes:

I intentionally omitted the locale tags in the localizations attribute and opted for it to only contain variable names that themselves has a locale attribute.
I wouldn't specify a default locale if one is not defined in the dataset, but rather separate datasets into "aware" or "naive" indicated by the presence of the locale attribute in the globals or on a variable.

Advantages I found with this approach:

No need to come up with attribute name mangling conventions
No need to allow different characters than already allowed
The global/primary attributes remain uncluttered, only two/three additional attributes with well defined/controlled values
I was able to immediately extend this idea to the data itself without much work, even if we don't want to allow it now.
ERDDAP can ignore the localization providing variables since, IIRC, you need to configure ERDDAP with the specific variable you want to have (even if the underlying netCDF files has more)

Disadvantages:

Many new variables in the file, in my example with many languages, this looks cluttered. I would expect most real world usage to be English + one other locale
Probably other things I cannot think of due to being a pythonista

Edit: formatting of the Disadvantages

Dave-Allured commented 1 year ago

@turnbullerin, ERDDAP #114 says that ERDDAP already allows any Unicode character in a source attribute name. That would include the ASCII characters .-[]: from options 3, 4, 5 in your last message. I think this means that ERDDAP will not crash and will safely sanitize those characters. Convert to underscores, IIRC.

Do you think this is sufficient for 3, 4, or 5 to be minimally acceptable for ERDDAP?

turnbullerin commented 1 year ago

ERDDAP will allow them but convert them into supported syntsx for their XML file. The XML file drives what is put into the files people download. So if I put "title.fr" in my source file, the config file, website list of metadata, and download file will have "title_fr". This is then non-standard and users of the downloaded file won't recognize it so the localized metadata is no longer programmatically extractable (even if it is still maybe human readable).

Also then ERDDAP can't rely on the CF convention for localized metadata to identify and use localized titles and descriptions in the ERDDAP web interface since the names CF specifies won't be available in the XML configuration file (which enforces the A-Z0-9_ convention). Which is my goal here, to make an interoperable standard so that tools like ERDDAP can display localized metadata to users in a language of their choosing.

larsbarring commented 1 year ago

Just to reiterate what was mentioned before (here, here and here, , I think we all agree that it is not on the table to change attribute names of those attributes that can take free text as value. That is, such attributes, for exampletitle, will remain untouched.

What we are discussing is however to complement such attributes with new attribute names that has a suffix that also specify alternative languages (locales) in their free text values, that is attribute****suffix. E.g. title*suffix*, where *suffix* is the new element that we are discussing. As the netCDF data model does not have a separate concept "suffix" in relation to attribute names it means that adding a suffix to an attribute name in effect creates a new attribute name according to netCDF.

In a previous comment @turnbullerin constructively laid out 5 alternative ways to add the suffix to the attribute name. I will summarise these with examples, and then add some questions of clarification and comments.

if you had locale_others = "_fr: fr-CA"; your French title would be in title_fr
suffix starting with an underscore and replacing hyphens with underscores (i.e. title_fr_CA)
format them as .TAG , e.g. title.fr-CA
As option 3 but title[fr-CA]
namespacing feature that was proposed awhile ago, i.e. fr-CA:title

My questions and comments:

A) Option 5 is least favoured by @turnbullerin, to which I agree. Hence I will have no more detailed comments.

B) It is not clear to me whether option 1 means that a language subtag (the "-CA" part in the example) is always dropped, or if it can it be included, e.g with the hyphen - replaced by an underscore.

C) In option 2 the language subtag ("-CA") may or may not be present as a "_CA".

D) Generally in CF the underscore _ is used to denote the \<space> character where one is not allowed for technical or formatting (netCDF or CF) reasons.

E) In option 1 and 2 the suggestion is to overload the underscore with a new role, namely as a delimiter between an attribute name (which may include one or several underscores) and its suffix (which may or may not contain an underscore). Then suffixes can be added without changing the current character set restrictions imposed by CF. Thus it may seem like an easy solution, but I am not convinced that this is a good idea for several reasons. Firstly, when parsing the attribute name to extract the suffix (language name tag) may be difficult to determine which underscore is the one to break at. Secondly, once the suffix is extracted it is not in the same format according to the standards.

F) Options 3 and 4 are very similar, the language name tag is formatted with a hyphen which follows the standards. It will however require that the CF character set is extended to allow the hyphen -, and either the period . or the two square brackets [ ].

I will have to stop for a moment, but will come back to F) and interoperability in my next comment.

turnbullerin commented 1 year ago

@larsbarring for clarity, in Option 1 the format of the suffix is entirely up to the originator of the file and is specified completely in locale_others. All of the following files would be valid ways of specifying the French title

:locale_others = "_fr: fr-CA";
:title_fr = "French Title";

OR

:locale_others = "_fr_CA: fr-CA";
:title_fr_CA = "French Title";

OR

:locale_others = ".fr-CA: fr-CA";
:title.fr-CA = "French Title";

OR 

:locale_others = "[fr-CA]: fr-CA";
:title[fr-CA] = "French Title";

OR

:locale_others = "--foobar: fr-CA";
:title--foobar = "French Title";

The list of valid suffixes can then be determined from the locale_others attribute and any attribute or variable ending in a valid suffix is then considered to be a localized version of the non-suffixed attribute or variable. The downside of this option is the potential for confusion and a more complex parsing algorithm but the upside is we allow data originators to define the scheme that best works for them and their use case and doesn't conflict with any other names in their file. So if following ERDDAP conventions is important, they can specify suffixes that meet ERDDAP conventions. If they use periods to mean something in their variable names, they can use the square bracket syntax instead. Or they can invent their own.

turnbullerin commented 1 year ago

How about an approach that uses meta variables to contain localization information? This approach being inspired by how geometry_containers work in CF. I've coded up an example that took this to the extreme as I extended the idea all the way to localizing the data itself, I'll try to explain it here.

We reserve 2 or 3 new attribute names that apply to global (and potentially variable) attributes:

locale - a string containing a single BCP 47 locale identifier

localizations - a string containing a space separated list of variable containing localized attributes for this scope, global or variable.

(optionally) localized_data only on variables, indicates that the data itself should be localized.

For the attributes:

All CF attributes (and ACDD ones or whatever), continue to use the standardized English attribute names. The locale of the values of those names is contained in the new attribute locale which must contain a BCP 47 locale tag.

If other localizations are available, the attribute localizations must contain a space separated list of other variable names (like the coordinates attribute in data variables) in the dataset.

On a data variable, the special attribute localized_data may be present with some truthy value (I used 1) that indicates the localization providing meta variable also contains localized data that should replace the data.

Localization providing meta variable:

The actual variable name of the meta variables are not controlled, but must follow other naming restrictions already in CF or your environment (ERDDAP, matlab, etc..) so they may appear in that space separated list.

A variable referenced by the localizations attribute is a localization meta variable

This variable contains a locale attribute with a BCP 47 locale tag with the locale of the attribute values

All other attributes on this meta variable are localized versions of the attributes in the referencing scope (global or variable), e.g. title would still be title. Not all attributes of the referencing variable must also be present in on the meta variable, only the localized attributes. I.e. the meta variables attributes must be a strict subset of the referencing variable.

If the localized_data attribute of the referencing variable is set, then this meta variable must contain data with the same shape as the referencing variable.

Other notes:

I intentionally omitted the locale tags in the localizations attribute and opted for it to only contain variable names that themselves has a locale attribute.

I wouldn't specify a default locale if one is not defined in the dataset, but rather separate datasets into "aware" or "naive" indicated by the presence of the locale attribute in the globals or on a variable.

Advantages I found with this approach:

No need to come up with attribute name mangling conventions

No need to allow different characters than already allowed

The global/primary attributes remain uncluttered, only two/three additional attributes with well defined/controlled values

I was able to immediately extend this idea to the data itself without much work, even if we don't want to allow it now.

ERDDAP can ignore the localization providing variables since, IIRC, you need to configure ERDDAP with the specific variable you want to have (even if the underlying netCDF files has more)

Disadvantages:

Many new variables in the file, in my example with many languages, this looks cluttered. I would expect most real world usage to be English + one other locale

Probably other things I cannot think of due to being a pythonista

Edit: formatting of the Disadvantages

Just to make sure I understand, this is proposing basically one extra variable with no data per language that would have the global attributes set as variable attributes? And one extra variable per language per variable with both localized metadata (i.e. long_name) and, if applicable, the actual data localized? Then tracking all of that with attributes to connect the dots?

This feels inefficient to me but I'll let others weigh in as well :).

larsbarring commented 1 year ago

@turnbullerin thanks for explaining how you envisage you option 1.

I will here continue my previous comment that I had to pause. As I wrote, I think that we should very careful in overloading the underscore with conceptually new roles. I interpret earlier comments from @turnbullerin, and @aulemahal (and possibly others) that this is seen as a necessity to meet restrictions from downstream systems and applications, rather than something desirable in its own right.

While I do think that interoperability is a key concept for CF (essentially that is why we have CF in the first place...). But there will always be software somewhere for which some new functionality or concept will not be possible at all to implement, or just not practical for some reason. Hence I think that we have to be concrete and specific when using concerns for interoperability as a argument.

In this issue ERDDAP has be used as a use-case of an important downstream application. Thank you @rmendels for your comment regarding ERDDAP, and for the link to the https://github.com/ERDDAP/erddap/issues/114 issue !

When browsing through that issue I see that the conversation soon expanded to deal with the implications for ERDDAP if all (or at least a large set of) unicode characters were to be allowed in attribute names, and in variable names. In that respect it pretty much mirrors what is going on in https://github.com/cf-convention/cf-conventions/issues/237. This was maybe where we were at in our conversation here a couple of weeks ago when "your issue" was initiated. Since then the conversation here has developed so that now only two or three additional characters are needed to implement localization. And these are hyphen -, as well as either period . or the two square brackets [ ], all from the good old ASCII character set. @rmendels do you think this in any way makes it more tractable for ERDDAP?

DocOtak commented 1 year ago

@turnbullerin Yes, basically one extra variable per language/locale, I didn't want to use the term "namespace" but this is a mechanism I saw in the netCDF-LD ODC propsal. My example is a little busy as I hadn't thought about the variable localization at all yet in these discussions. And when I realized I could localize data, it felt really powerful and I immediately tried it. The localization variables could be shared between data variables so perhaps not every data variable would need and independent localization variable (e.g. if it uses entirely controlled attributes).

The intent of my proposal was to avoid all the attribute name convention arguments. My proposal uses some pretty well established CF mechanisms to keep our proposal from conflicting with what might already be in the file. One of the other issues that the ERDDAP team raised was how to parse attribute names and how that might conflict in existing datasets.

Using a non standard BCP 47 locale tag (i.e. one that has had dashes replaced with the underscore) I think would be bad. Even though I disagree strongly with using netCDF variable and attribute names as program language symbols... I'm now more hesitant with introducing any sort of parsing grammar for the attribute names themselves given the concerns expressed by the ERDDAP team (@Dave-Allured ?). So my most recent proposal completely does away with needing to parse attribute names other than matching exactly in the same file.

I suspect that, given what I know about how ERDDAP is configured, the extra variable proposal would allow localizations to be added to an existing dataset on ERDDAP today and it would ignore the extra variables, unless reconfigured to be aware of them. The extra attributes in the data variables and global attributes would not have anything ERDDAP breaking in them. ERDDAP would continue to be unaware of localizations in the dataset until that functionality is added.

I would like to prepare a "real" data file with only French and English to see what it would actually look like. @turnbullerin do you have any thing that could be used for this example?

larsbarring commented 1 year ago

From the recent conversation over at https://github.com/ERDDAP/erddap/issues/114 I think the position of the ERDDAP folks is clear. They are pretty dependent on a character set limited to [A-Za-z0-9_] for variables and attributes.

From my side I do not have much more to contribute regarding how to introduce localization into CF than what I stated before. If the top priority is to support existing software (irrespective of age and provenance) then using underscore to implement localization seems as the only option. The drawback is that this introduces the new role for the underscore to be a delimiter between attribute and locale. Moreover, and importantly, CF would then become even more locked down into the current character set restrictions, while the general netCDF community goes to the other extreme by allowing almost all of Unicode. An at the same time as there are (and will be ) more and more well motivated requests from various communities for relaxing the restrictions. But that is a conversation more suited in https://github.com/cf-convention/cf-conventions/issues/237.

If the conclusion is that the CF community should go ahead with underscore to implement localisation i will not be the one that blocks.

DocOtak commented 1 year ago

@larsbarring I've attempted an option that eliminates the use of underscore or any attribute name parsing (only attribute values) in this comment. Please take a look. If my kitchen sink example is too busy or hard to understand, I could make a simpler one.

PS: @turnbullerin Don't be discouraged by this long process, actual changes to the conventions take time and everyone here is a volunteer

larsbarring commented 1 year ago

Hi @DocOtak, it took me a little while and some experimentation to get into what you suggest. To me it looks like a general and powerful approach, but also a bit awkward by requiring one variable per locale and per variable that have localized attributes (as @turnbullerin notes). This might be a possible solution. At the same time I was looking back with (at least somewhat) fresh eyes on other suggested solutions:

The solution I was pushing for (e.g. title.fr_CA) has the drawback that it reserves the period for a specific function. If the aim is to have the same rules for attribute names and variable names, which I am not sure we have to, this goes against what has been requested elsewhere. And this problem will remain the same irrespective of which character(s) we might select to delimit the locale. Because of that we might as well stay within the current character set, which means that I am [reluctantly!] accepting underscore as delimiter for this particular purpose.
Looking at @turnbullerin's second alternative ("suffix starting with an underscore and replacing hyphens with underscores (i.e. title_fr_CA)": I think this alternative is complicated by the fact that the attribute may have any number of underscores and the locale identifier, if at all present, might have zero or one underscore. This is a complication when decoding at which underscore to make the break between attribute and its locale.
Turning to the first alternative "if you had locale_others = "_fr: fr-CA"; your French title would be in title_fr", I think the flexibility is more of a complication that being helpful.

Based on these comments, here is another simplified alternative inspired by Erin's first and second alternative. Only the one global attribute locales is needed:

If it is not present there is no information as to the language used in the relevant attributes, and there are no localized attributes. This is the present situation.
If present it will contain a space separated list of \<key>:\<tag> pairs.
The \<key> is either the [reserved] word default, or a string beginning with an underscore, and no underscores elsewhere.
The \<tag> is a known IETF BCP 47 language tag.
If the key is default then the \<tag> is supposed to inform, without any guarantee, about the language used in the relevant attributes that do not have a have a language tag as suffix. If this key is not present there is no information as to the language used in these attributes.
Localized attributes are identified by the suffix formed by attaching a key from the list at the end of the attribute name.
The localization will be applied to all attributes (throughout the file) that has a suffix that is among the tags in the locales
If a tag is not used as suffix in any attribute name then nothing happens. If the suffix of an attribute is not a key listed in locales then the attribute (incl. the suffix) is basically not a CF attribute.

An example:

// global attributes:
        :locales = "default:en_US _sv:sv _esmx:es-MX" ;
        :title = "English Title" ;
        :title_sv = "Svensk titel" ;
        :title_esmx = "Título en español" ;

In this example I have used "mangled" language tags as keys, but this is not requires (but good practice?). This has the advantage of easy reading for humans, and still simple decoding for software. If one wants to restrict the freedom in choosing keys an alternative is to allow only "loc1", "loc2", "loc3", .... but I do not think this is necessary.

This suggestion has the following advantages: it requires only one global attribute, the keys (suffixes) only have an underscore as first character, the tags follow the established format, it is"lightweight".

It seems so simple that I wonder if I have overlooked something?

turnbullerin commented 1 year ago

@larsbarring Your concept seems very similar to what I was proposing and I think it is great. I think there's some value in separating out the "default" tag but I'm not married to the idea of it being in a separate attribute; having it with a "magic" suffix seems fine too. I'm just not a fan of "magic number" type things and adding an extra attribute made more sense to me. Alternative to a magic suffix, we could say that locales must follow the format DEFAULT_TAG SUFFIX1:TAG1 SUFFIX2:TAG2 ... and just omit a suffix entirely for it?

Is there value in restricting the character set like this though? From what I was reading, CF doesn't tend to make things mandatory without a good cause. I appreciate the "underscore = space" argument but I think that's actually a good reason to not make it REQUIRED so others can make their own decisions on how to mangle and what characters to use or omit.

Instead, I would suggest we RECOMMEND starting suffixes with an underscore followed by ASCII letters (A-Z and a-z) for maximum compatibility. To avoid parsing complications I would suggest though that we say that colons MUST NOT be part of the suffix (and they won't be part of the language tag by BCP 47) which makes it very easy to parse and to identify the default tag if we use my suggest above (it is the one without a colon after splitting on spaces)