Open turnbullerin opened 4 months ago
Here is the draft text we were looking at in the discussion thread
ADDITION TO 2.5 (prior to 2.5.1 heading, after the existing text)
Files that wish to provide localized (i.e. multilingual) versions of the content of variables shall reference section #TBD for details on how to do so.
ADDITION TO 2.6 (prior to 2.6.1 heading, after the existing text following 2.6)
Files that wish to provide localized (i.e. multilingual) versions of the content of attributes shall reference section #TBD for details on how to do so.
NEW SECTION
Certain attributes and variables in NetCDF files contain natural language text. Natural language text is written for a specific locale: this defines the language (e.g. English), the country (e.g Canada), the script (e.g English alphabet), and/or other features of the natural language. Locales are defined by a "language tag" that follows the format specified in BCP 47 //link to BCP47 here//, such as en-CA
for Canadian English in the default script. This section defines the standard pattern for localizing the contents of a NetCDF file.
Localization of attributes and variables is limited to natural language string values that are not taken from a controlled vocabulary. See Appendix A for recommendations on localization of CF attributes. Non-CF text attributes that use a natural language may also be localized using these rules. To localize an attribute or variable, an alternative version of it is supplied using a suffix for its name that is associated with the language tag.
A "localized file" is one that provides the global attribute localizations
. If present, the attribute must contain a space-delimited list words in the format suffix: language_tag
. For example, the string default: en _fr: fr-CA _es: es-MX
specifies that the default locale of the file is en
, that the suffix _fr
indicates content in the fr-CA
locale, and that the suffix _es
indicates content in the es-MX
locale. Suffixes may be any text string allowed in an attribute or variable name, but it is strongly recommended that they be chosen for clarity by making them clearly associated with the locale they represent.
The default locale should be chosen to represent the most complete set of attributes and variables; if only some of the natual language text attributes have localized versions, then the more complete language should be chosen as the default. Where there are two or more complete sets, the predominant language that the content was originally written in should be chosen.
An attribute or a variable in a localized file must not have a name ending with a locale suffix unless it is used to indicate the locale as per this section.
Applications that process NetCDF files are encouraged to apply BCP 47 in determining which content to show a user when localized content is available. When content is not available in a suitable locale for the user, the default locale should be used.
Localized attributes are created by appending a locale suffix to the usual attribute name. For example:
variables:
:localizations = "default: en-CA _fr: fr-CA _es: es-MX";
:title = "English Title"; // English title
:title_fr = "Titre française"; // French title
:title_es = "Título en español"; // Spanish title
:summary = "English Summary";
:summary_fr = "Sommaire française";
// omitted Spanish summary means English will be used instead
double salinity(i);
salinity:long_name = "Salinity";
salinity:long_name_fr = "Salinité";
salinity:long_name_es = "Salinidad";
Localized variables are created by appending a locale suffix to the variable name; note that this is only necessary where the data stored in the variable itself is localized and does not come from a controlled vocabulary. Natural language attributes for a localized variable should be provided in the locale of that variable. Localized versions of a variable must be of the same data type and dimensions and must contain the same number of elements appearing in the same order (i.e. weather_obs[0]
is the English text and weather_obs_fr[0]
is the French text of the same value).
variables:
:localizations = "default: en-CA _fr: fr-CA _es: es-MX";
char weather_obs(i);
weather_obs:long_name = "Weather Conditions";
char weather_obs_fr(i);
weather_obs_fr:long_name = "Observations Météorologiques";
char weather_obs_es(i);
weather_obs_es:long_name = "Observaciones Meteorológicas"
data:
weather_obs = "sunny", "rainy", ...;
weather_obs_fr = "ensoleillé", "pluvieux", ...;
weather_obs_es = "soleado", "lluvioso", ...;
ADDITION TO APPENDIX A
References https://www.rfc-editor.org/info/bcp47 https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
EDIT NOTES:
cell_methods
and updated mine to matchThis looks good. Nice work!
flag_meanings
seems like it ought to be locale-aware.
I think it could be hard to make history
locale-aware. The history attributes in the files I work with mostly consist of sequences of command-line invocations, which you don't want to translate because then it's not the command that was actually used. When there is a comment, it's usually something that a piece of software added automatically, rather than the file creator, so there's no control over what language it's in. Depending on the tools you use, it seems like it wouldn't be hard to end up with a history attribute that's multi-lingual or in a different language than the one you want to declare as the default.
I'm struggling with this for reasons I'm having trouble articulating (and am not sure are even valid reasons), I'll try.
I don't think CF should introduce attribute and variable name modifications as a concept. Section 2.5 starts out with:
This convention does not standardize variable names.
With this proposal, it would be except in cases of localization. This proposal also places restrictions on variable/attribute names and their suffixes.
I think the ERDDAP "elephant in the room" is important and relevant here. For ERDDAP adoption in Canada, it needs to display localized content, so this proposal must work in a real world application. I'm fearful that this will not work in ERDDAP for some reason we don't know and won't know until an implementation is worked on and the standard has already been codified.
Would there be willingness on the ERDDAP side to implement something that isn't quite in CF yet? And on the CF side, would we be willing to codify whatever ends up working best in ERDDAP? Realizing that someone involved in both communities might need to actually do the coding.
@DocOtak @turnbullerin As I told Erin, the best way to help see if this can be done is to contribute code to ERDDAP to do so and be tested. There are ready-to-go development environments, compiling and testing have been simplified. Chris is only part-time, but I am sure it would be happy to work with people on this.
For an update, the ERDDAP issue now has a link to a working prototype I developed based on this proposal for the title attribute - see https://github.com/ERDDAP/erddap/issues/114
This is heading towards resolution and I don't want to impede things if the answer is 'no', but over in discussion #341 we're talking about whether string arrays should be allowed, so I just want to ask the question:
Would it be dramatically simpler to solve this problem if we used an array of strings for different localizations of free-text attributes?
The current solution is effectively creating an ad hoc array of attributes by appending suffixes to the attribute name. Would it make things easier on the implementation side to have a for-real array of attribute values with, say, a localization prefix (e.g., [fr-CA]:
at the beginning of each string?
My guess is that the answer on the ERDDAP side is probably 'no', and since that's the primary driver here, ERDDAP-viability is paramount, so if that's the case, I think things should proceed as if I had not commented at all. But I wanted to bring it up before we get locked in to a solution, just in case the alternative would make everyone's lives a lot easier.
Interesting idea -- thanks for a potential use case!
But I fully agree with you @sethmcg that for this particular issue of localization should we go with what most effectively and with least effort move this forward towards conclusion.
@sethmcg @larsbarring
I think more progress will be made at the CF Workshop hackathon. I'd personally like to revisit/present again a container variable solution that does not need any string parsing or manipulation, other than the BCP 47 language tag which tends to have library support.
I'm hopeful I'll be able to attend that workshop in person.
Andrew, @DocOtak, it would be great if you were able to come in person! Just a friendly reminder to register for the event so that the security clearance formality can be sorted (and please register even if attending only virtually :-)
@larsbarring filled out the form yesterday, hopefully it's all ok.
All good -- thanks! /Lars
I suggest that a container solution will be workable, and will have about the same complexity compared to the original concept for multiple attribute names with suffixes. I favor suffixes over container, for readability and transparency. By transparency I mean that the meaning of suffixes would be obvious to the casual user, with only the knowledge that the suffixes are BCP 47, and no other details from the CF document.
One particular advantage of suffixes is that you get backward compatibility with no complications. If you have e.g. title
and title.fr
, then existing naive applications will continue to operate normally and display title
, with no awareness nor interference from alternative attributes. With a container, I am afraid that you would need either duplicate copies of the default string, or else some other awkward device to tell the container to look elsewhere for the default string.
Erin, you have put a lot of thought and care into your current proposal seen above. What you have envisioned is quite workable. However I recommend simplification. We have discussed some of this before.
"... instead, data providers are free to choose suffixes that meet their use case" No. CF needs to choose a single concise solution, not try to please everyone including me.
Omit the localizations
attribute, and the inventorying and indirection that it represents. Or at least make it optional. Code BCP 47 abbreviations directly into whatever tagging mechanism is chosen. E.g. title.fr-CA = "Titre française"
. The same can be done with containers. As I have said before, file scanning to build an inventory is easy and efficient.
Omit section TBD.1 Localized Files. Allow independent localization for each object in a Netcdf file. This does not block any external requirement for a data set to be "localization complete".
Move localization of variables to a separate discussion, and focus on attributes only. This is only a mechanical suggestion because the attributes discussion is hard enough, and variables have different constraints and considerations. I have a nice alternative suggestion for variables when the time is right.
I was really hoping to participate in the hackathon for this. I had even managed to get the ERDDAP source working on my computer but couldn't do much more than build/run it. I'm not a Java person and was really hoping for help on this. While at the hackathon, I was planning (see above thread) to have a container variable possible solution implemented in ERDDAP that could be compared to the one done already. I'm motivated by two things:
I strongly feel that CF should avoid introducing new attributes who's values require custom parsing algorithms. The complexity parsing the strings in e.g. cell methods I think was a misstep, and in very casual conversation at the workshop (I think over beers in the evening), the feeling was these are easy to write, but very difficult to parse and use as someone receiving the data. CF should trend toward attributes whos values are either exactly defined by CF in an enumeration, even a massive one like the standard_name, or is defined by an external standard that we don't change/extend (like BCP 47). Even in the example ERDDAP implementation some of the difficulties of dealing with custom parsing was noted in the a comment:
// TODO: we should enforce a better removal of ":" to ensure the string is well formatted"
A container variable was just added for describing lossy compression via quantization.
I think I made some example files a year or two ago, but IIRC they were complex to show capabilities. I'll try to make some simple files soon and show them here.
One particular advantage of suffixes is that you get backward compatibility with no complications. If you have e.g.
title
andtitle.fr
, then existing naive applications will continue to operate normally and displaytitle
, with no awareness nor interference from alternative attributes. With a container, I am afraid that you would need either duplicate copies of the default string, or else some other awkward device to tell the container to look elsewhere for the default string.
I was mistaken, so I retract this comment. This backward compatibility could be done exactly same way with either containers or suffixes. Either way, I expect there would normally be a traditional attribute in the default locale (to be determined elsewhere). Then the "extended attributes" would consist ONLY of the non-default values. Here is a simple imagined container example.
:title = "English Title";
:title_localized = "fr-CA:Titre française; es-MX:Título en español";
Hey all, sorry for missing the hackathon, I had some significant health issues come up.
In terms of using .fr-CA
, I think this is not a feasible solution for two
reasons.
One, the CF conventions state "It is recommended that variable, dimension, attribute and group names begin with a letter and be composed of letters, digits, and underscores.". I would not want to introduce a specification that makes it mandatory to disobey a recommended practice.
Two, the ERDDAP folks have made it pretty clear that they won't change their own policy of requiring attributes to follow the CF recommendation because of some valid concerns. So any pattern using non recommended characters would make it incompatible with ERDDAP.
For these reasons, if we are going to use a single convention, it would then either requiring breaking compatibility with ERDDAP and changing the current recommendation (which was extensively debated and decided against) or set to be something that uses only alphabetical and underscores, like "_fr_CA" or "__fr_CA". I'm ok with this. I think it is still worth listing them in an attribute in a way that makes it clear what the default is and what the available options are - otherwise, to identify the languages available would require checking every possible option which is an ever growing list. We'd just need to agree on a pattern for doing it, which was what opened a big debate and resulted in me suggesting we just let users specify their own suffix so that you can break compatibility with ERDDAP and the CF recommendations if you want, at your own risk.
On the topic of containers, I don't see how it's implementable in ERDDAP since ERDDAP only allows for global and variable attributes and ERDDAP, as far as I know, really doesn't like "dummy" variables. Plus I think the implementation is still too complex for it to be easy to add languages.
I'll try to keep tabs on this but I have some healing to do for the next couple of weeks.
On Tue, Sep 24, 2024, 8:10 PM Dave Allured @.***> wrote:
One particular advantage of suffixes is that you get backward compatibility with no complications. If you have e.g. title and title.fr, then existing naive applications will continue to operate normally and display title, with no awareness nor interference from alternative attributes. With a container, I am afraid that you would need either duplicate copies of the default string, or else some other awkward device to tell the container to look elsewhere for the default string.
I was mistaken, so I retract this comment. This backward compatibility could be done exactly same way with either containers or suffixes. Either way, I expect there would normally be a traditional attribute in the default locale (to be determined elsewhere). Then the "extended attributes" would consist ONLY of the non-default values. Here is a simple imagined container example.
:title = "English Title"; :title_localized = "fr-CA:Titre française; es-MX:Título en español";
— Reply to this email directly, view it on GitHub https://github.com/cf-convention/cf-conventions/issues/528#issuecomment-2372606545, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVNFBVSL4DLNYEHXROYJUOTZYH5OXAVCNFSM6AAAAABKQ7G2PKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNZSGYYDMNJUGU . You are receiving this because you were mentioned.Message ID: @.***>
Dear Erin
Sorry for your illness; I'm sure we all hope you recover soon.
In Sect 2.3 of the working draft, we quite recently added, "ASCII period (.) and ASCII hyphen (-) are also allowed in attribute names only." Therefore .fr-CA
in an attribute name would not go against CF recommendations. If I remember correctly, this was done to support the present issue.
Best wishes
Jonathan
Erin,
I hope that your situation is improving.
I just want to make a couple of comments in relation to your status update.
.fr-CA
solution simply based on that the period and hyphen is not recommended but allowed, or are there deeper concerns regarding interoperability? I believe the specific wording was more of a mistake than a deliberate choice. And we are right now discussing updating which characters (Unicode codepoints) are recommended, allowed and disallowed, as well as making the distinction between these categories more clear. __fr_CA
is on the table, and I just want to draw attention to the fact that double underscores (Unicode low_lines) already have special meaning in relation to OGC netCDF-LD, specifically for prefixes. As I remember the previous discussions on localization I think we mainly thought of the tags as suffixes. As have no knowledge of OGC netCDF-LD I cannot judge if these two uses of double underscore are consistent with each other, or whether they might be contradictory thus creating interoperability clashes. Ping @ChrisJohnNOAA
The main concern I have from ERDDAP is that we allow for exporting of data in many different formats intended for use in a variety of programing languages, both of which can have limitations on what characters can be used in variable names. If this suffix is included in the data files that ERDDAP exports, we are severely limited in what characters can be included.
If you need more context than that, this part of the ERDDAP discussion might be useful: https://github.com/ERDDAP/erddap/issues/114#issuecomment-1793133970
both of which can have limitations on what characters can be used in variable names.
@ChrisJohnNOAA, okay. How about we focus on attribute names only, and skip variable names for now? By my analysis, any reasonable user program in any language should regard attribute names with special characters as unknown optional attributes, and ignore them by default. Will that be okay for ERDDAP?
Here is my "Container Variable" proposal for localizations
Define two new attribute:
Localization data is contained in a container variable, like the existing geometry container and the newly added quantization container variables. Localization containers must contain a locale attribute with a BCP 47 language tag. All other attributes on this container variable are localized versions of attributes in the referencing scope (global or variable).
Why I like this:
locale
attribute is defined well by an external standardlocalizations
attribute is using well established CF convention of a space separated string (e.g. coordinates, ancillary variables)The last three bullet points I view as a security feature that might be relevant when implementing in something like ERDDAP that I think has some US Government requirements imposed on it.
Here is a simple example for what this looks like in a full CDL:
netcdf locale_example {
variables:
double alt_locale ;
alt_locale:locale = "fr-CA" ;
alt_locale:title = "Titre en français" ;
// global attributes:
:locale = "en-CA" ;
:title = "English Title" ;
:localizations = "alt_locale" ;
data:
alt_locale = NaN ;
}
If the users locale was Canadian French, the attributes from the available Canadian French localization container variable would replace those in the global attributes. BCP 47 defines some algorithm on quality weights and how to fall back to find a localized string, we must define a default behavior if no matching locale is found, which for us would be the locale that is not in a container variable. In the above example, the default locale is en-CA
. If a variable referenced by localizations
does not exist in the netCDF, it should be ignored.
For ERDDAP think the combination of combinedGlobalAttributes
and the contents of the edv
array should allow the construction of a data structure that could hold all the localization data and the java.util.Locale
class looks like it can handle all the matching and parsing. Since this proposal uses the BCP 47 language tags without any additions, the contents of the locale
attribute would be able to be passed in without additional processing (presumably the library would throw if the tag is illformed).
Other situations:
locale
and localizations
list mechanism. (Question: should the locale
attribute be optional if locale
is present in the global attributes?)@DocOtak @larsbarring - Above it says:
"The last three bullet points I view as a security feature that might be relevant when implementing in something like ERDDAP that I think has some US Government requirements imposed on it."
Besides the points made by @ChrisJohnNOAA that we try to be compatible with as many platforms as possible, which puts constraints that we can't control, we get all sorts of security scans, many of which flag everything under the sun. These are scanners like Nessus and Qualys, and we try to make sure that to the best of our knowledge an ERDDAP release will not fail a security scan. This can also place some restrictions on what we can do (and why we encourage people to upgrade to the latest version - the last thing we want is for an ERDDAP running anywhere to fail a scan, which could lead to a lot of shutdowns until fixed - many of you may not remember when that happened to OPenDAP/TDS quite a few years ago).
@rmendels I knew ERDDAP has some additional constraints like that, but wasn't sure the details. I do know that input sanitization is "hard" and want something that the code owners of ERDDAP can be confident in.
@DocOtak In principle I like the 10 bullet points of your "Container Variable" solution, but I am not sure how it plays out when other components of a file also is localized. Would you be able to create a small but complete CDL file ?
At the same time I (still) kind of like my own suggestion based on Erin's previous ideas. Here is the CDL for a small complete mockup example:
netcdf test_LB {
dimensions:
lat = 5 ;
lon = 1 ;
variables:
double lat(lat) ;
lat:standard_name = "latitude" ;
lat:long_name = "latitude" ;
lat:units = "degrees_north" ;
lat:axis = "Y" ;
double lon(lon) ;
lon:standard_name = "longitude" ;
lon:long_name = "longitude" ;
lon:units = "degrees_east" ;
lon:axis = "X" ;
float uas(lat, lon) ;
uas:standard_name = "eastward_wind" ;
uas:long_name = "Zonal Surface Wind Speed" ;
uas:long_name_sv = "Zonal vindhastighet nära marken" ;
uas:long_name_fr = "Vitesse du vent zonal en surface" ;
uas:long_name_esmx = "Velocidad de viento en superficie" ;
uas:units = "m s-1" ;
uas:_FillValue = 1.e+20f ;
uas:ancillary_variables = "uas_qc" ;
byte uas_qc(lat, lon) ;
uas_qc:long_name = "Data quality of Zonal Surface Wind Speed" ;
uas_qc:long_name_sv = "Data kvalitet hos zonal vindhastighet nära marken" ;
uas_qc:long_name_fr = "Qualité des données sur Vitesse du vent zonal en surface" ;
uas_qc:long_name_esmx = "Calidad de datos de la velocidad de viento zonal en superficie" ;
uas_qc:standard_name = "status_flag" ;
uas_qc:_FillValue = -128b ;
uas_qc:valid_range = 0b, 2b ;
uas_qc:flag_values = 0b, 1b, 2b ;
uas_qc:flag_meanings = "quality_good sensor_nonfunctional outside_valid_range" ;
uas_qc:flag_meanings_sv = "kvalitet_godkänd ickefungerande_sensor utanför_godkänt_intervall" ;
uas_qc:flag_meanings_fr = "qualité_bonne capteur_non_fonctionnel plage_valide_extérieure" ;
uas_qc:flag_meanings_esmx = "calidad_buena sensor_no_funcional fuera_rango_válido" ;
// global attributes:
:Conventions = "CF-1.8" ;
:locales = "default:en-US _sv:sv _fr:fr _esmx:es-MX" ;
:title = "This is a test" ;
:title_sv = "Detta är ett test" ;
:title_fr = "Ceci est un essai" ;
:title_esmx = "Este es un ensayo" ;
data:
lat = 0, 5, 10, 15, 20 ;
lon = 0 ;
uas = 1, 2, 4, 48, 160 ;
uas_qc = 0, 0, 0, 2, 1 ;
}
It does not tick off as many of the 10 bullet points, and I will in no way try to push for this solution.
But maybe @DocOtak you could do something similar for you "Container Variable" solution to let the ERDDAP folks -- and everyone else of course -- see for themselves and create small test files using ncgen -b
.
both of which can have limitations on what characters can be used in variable names.
@ChrisJohnNOAA, okay. How about we focus on attribute names only, and skip variable names for now? By my analysis, any reasonable user program in any language should regard attribute names with special characters as unknown optional attributes, and ignore them by default. Will that be okay for ERDDAP?
I was asked what the ERDDAP concern with ".fr-CA" was and I mentioned the concerns because in the past I've seen recommendations to treat Attribute and Variable names the same in CF. To be clear I think allowing '.' and '-' in variable names is a very bad idea.
I don't think attribute names are generally exported from ERDDAP for non-nc file formats. Most likely using '.' and '-' in attribute names would be fine from the ERDDAP perspective, though I haven't fully audited the ERDDAP code for that.
By my analysis, any reasonable user program in any language should regard attribute names with special characters as unknown optional attributes, and ignore them by default.
I'm not convinced that's a safe assumption. I mean, if you want to use that as a criterion for whether a user program is reasonable, fair enough. It's definitely what programs should do. But in terms of software that people actually use, I don't know that we can rely on that being true.
We have to remember that plenty of scientific software is written by people who are scientists first and coders second, and who may not follow best practices of software engineering. I will freely admit to being one of those people, and I have written lots of code that is very cavalier about things like checking inputs...
@larsbarring I translated your example into what I have proposed, it is a bit longer, attached is also an actual netCDF file of this. A thing that I might do is pull out the flag definitions into their own container variables and reference the common localizations from all the variables that use them the same way the quantization parameters container variable is meant to be referenced.
I have two concerns with the way your example encodes this information:
"default:en-US _sv:sv _fr:fr _esmx:es-MX" ;
you have no space between the suffix and the language tag, but in the proposal text above (copied here) "default: en-CA _fr: fr-CA _es: es-MX";
there is a space. There is no defined by us delimiter between the differentkey:value
pairs or we use the spaces in multiple contexts. In my opinion here, this type of encoding should not be used in any new concepts within CF. CF should avoid anything that has the data readers/users/consumers need to implement their own parsing to correctly interpret a string.While the following example is long and verbose, I think that it is worth it because everything is mostly in data structures that are in some ways more "ready to go" in that no transformations need to occur. I also think that by including locale information within the netCDF file, we are explicitly making something that is not meant for humans to look at without the computer processing the locale data first. I don't ever look at all the .po
or .mo
files. For client software reading these netCDF files, I would expect the user to basically say "I want to look at this file, this is my locale" then the software sets the localized attributes on the actual variables, then removes all the container variables since they are not needed anymore.
netcdf test_LB_b {
dimensions:
lat = 5 ;
lon = 1 ;
variables:
double lat(lat) ;
lat:standard_name = "latitude" ;
lat:long_name = "latitude" ;
lat:units = "degrees_north" ;
lat:axis = "Y" ;
double lon(lon) ;
lon:standard_name = "longitude" ;
lon:long_name = "longitude" ;
lon:units = "degrees_east" ;
lon:axis = "X" ;
float uas(lat, lon) ;
uas:_FillValue = 1.e+20f ;
uas:standard_name = "eastward_wind" ;
uas:long_name = "Zonal Surface Wind Speed" ;
uas:units = "m s-1" ;
uas:ancillary_variables = "uas_qc" ;
uas:localizations = "uas_locale1 uas_locale2 uas_locale3" ;
byte uas_qc(lat, lon) ;
uas_qc:_FillValue = -128b ;
uas_qc:long_name = "Data quality of Zonal Surface Wind Speed" ;
uas_qc:standard_name = "status_flag" ;
uas_qc:valid_range = 0b, 2b ;
uas_qc:flag_values = 0b, 1b, 2b ;
uas_qc:flag_meanings = "quality_good sensor_nonfunctional outside_valid_range" ;
uas_qc:localizations = "uas_qc_locale1 uas_qc_locale2 uas_qc_locale3" ;
double g_locale1 ;
g_locale1:title = "Detta är ett test" ;
g_locale1:locale = "sv" ;
double g_locale2 ;
g_locale2:title = "Ceci est un essai" ;
g_locale2:locale = "fr" ;
double g_locale3 ;
g_locale3:title = "Este es un ensayo" ;
g_locale3:locale = "es-MX" ;
double uas_locale1 ;
uas_locale1:long_name = "Zonal vindhastighet nära marken" ;
uas_locale1:locale = "sv" ;
double uas_locale2 ;
uas_locale2:long_name = "Vitesse du vent zonal en surface" ;
uas_locale2:locale = "fr" ;
double uas_locale3 ;
uas_locale3:long_name = "Velocidad de viento en superficie" ;
uas_locale3:locale = "es-MX" ;
double uas_qc_locale1 ;
uas_qc_locale1:long_name = "Data kvalitet hos zonal vindhastighet nära marken" ;
uas_qc_locale1:flag_meanings = "kvalitet_godkänd ickefungerande_sensor utanför_godkänt_intervall" ;
uas_qc_locale1:locale = "sv" ;
double uas_qc_locale2 ;
uas_qc_locale2:long_name = "Qualité des données sur Vitesse du vent zonal en surface" ;
uas_qc_locale2:flag_meanings = "qualité_bonne capteur_non_fonctionnel plage_valide_extérieure" ;
uas_qc_locale2:locale = "fr" ;
double uas_qc_locale3 ;
uas_qc_locale3:long_name = "Calidad de datos de la velocidad de viento zonal en superficie" ;
uas_qc_locale3:flag_meanings = "calidad_buena sensor_no_funcional fuera_rango_válido" ;
uas_qc_locale3:locale = "es-MX" ;
// global attributes:
:Conventions = "CF-1.8" ;
:title = "This is a test" ;
:locale = "en-US" ;
:localizations = "g_locale1 g_locale2 g_locale3" ;
data:
lat = 0, 5, 10, 15, 20 ;
lon = 0 ;
uas =
1,
2,
4,
48,
160 ;
uas_qc =
0,
0,
0,
2,
1 ;
g_locale1 = _ ;
g_locale2 = _ ;
g_locale3 = _ ;
uas_locale1 = _ ;
uas_locale2 = _ ;
uas_locale3 = _ ;
uas_qc_locale1 = _ ;
uas_qc_locale2 = _ ;
uas_qc_locale3 = _ ;
}
Please note that @Dave-Allured has opened conventions issue 548 to delete the sentence, "ASCII period (.) and ASCII hyphen (-) are also allowed in attribute names only." in Sect 2.3. This sentence was inserted into the working version by conventions issue 477 for various reasons, including to support IETF BCP 47 language tags, as discussed in this issue.
If Dave's proposal is accepted, the characters allowed for attribute names will be the same as for variable names in CF 1.12, which is the same as in CF 1.11, the most recently reduced version. @ChrisJohnNOAA commented above that "allowing '.' and '-' in variable names is a very bad idea". Please add your support to conventions issue 548 if you agree with @Dave-Allured that they should not be allowed.
Dear Andrew @DocOtak, @larsbarring, @turnbullerin et al.
I agree with Andrew that using an algorithm to predict the name of an attribute would be unlike previous CF practice. Although we choose meaningful names for CF attributes, all those names are explicitly defined (in Appendix A and elsewhere). They have to be hard-coded in software, and in that sense they are treated as if they were arbitrary, like variable and dimension names are.
Furthermore, although we could make a convention with suffixes for attributes work in netCDF, it might not work in other formats CF data could be converted into. Another format might have different rules about characters allowed in names, or it might not even have names at all.
Therefore I prefer the container variable as demonstrated by @DocOtak, but I'd combine it with a "keyword:
value" syntax like the one Erin @turnbullerin suggested. That is because this kind of attribute on the data variable tells you which container variable you want. Without it, you have to search them all to identify the right one, which isn't the general CF pattern. With this convention, the allowed keywords in the localizations
attribute are any of the IETF BCP 47 language tags. The locale
attribute is a data variable and global attribute, rather than a container variable attribute, so there are fewer attributes in total.
If I have understood correctly, @DocOtak, you don't like this kind of syntax. However, quite a lot of CF attributes use it. I don't think it's difficult to parse. It's a blank separated list of words, some of which (the keywords) end in :
. There should be space between :
and the value. Perhaps we should clarify this in the convention with a general statement. However, it's not hard to repair the mistake, if you want to tolerate it. The values and keywords never contain any other :
, so a string substitution of ":
" → ":
" will convert the string into the correct format.
With this syntax, the example would be as below.
Best wishes
Jonathan
netcdf test_LB_b {
dimensions:
lat = 5 ;
lon = 1 ;
variables:
double lat(lat) ;
lat:standard_name = "latitude" ;
lat:long_name = "latitude" ;
lat:units = "degrees_north" ;
lat:axis = "Y" ;
double lon(lon) ;
lon:standard_name = "longitude" ;
lon:long_name = "longitude" ;
lon:units = "degrees_east" ;
lon:axis = "X" ;
float uas(lat, lon) ;
uas:_FillValue = 1.e+20f ;
uas:standard_name = "eastward_wind" ;
uas:long_name = "Zonal Surface Wind Speed" ;
uas:units = "m s-1" ;
uas:ancillary_variables = "uas_qc" ;
uas:locale = "en-US" ;
uas:localizations = "sv: uas_locale1 fr: uas_locale2 es-MX: uas_locale3" ;
byte uas_qc(lat, lon) ;
uas_qc:_FillValue = -128b ;
uas_qc:long_name = "Data quality of Zonal Surface Wind Speed" ;
uas_qc:standard_name = "status_flag" ;
uas_qc:valid_range = 0b, 2b ;
uas_qc:flag_values = 0b, 1b, 2b ;
uas_qc:flag_meanings = "quality_good sensor_nonfunctional outside_valid_range" ;
uas:locale = "en-US" ;
uas_qc:localizations = "sv: uas_qc_locale1 fr: uas_qc_locale2 es-MX: uas_qc_locale3" ;
double g_locale1 ;
g_locale1:title = "Detta är ett test" ;
double g_locale2 ;
g_locale2:title = "Ceci est un essai" ;
double g_locale3 ;
g_locale3:title = "Este es un ensayo" ;
double uas_locale1 ;
uas_locale1:long_name = "Zonal vindhastighet nära marken" ;
double uas_locale2 ;
uas_locale2:long_name = "Vitesse du vent zonal en surface" ;
double uas_locale3 ;
uas_locale3:long_name = "Velocidad de viento en superficie" ;
double uas_qc_locale1 ;
uas_qc_locale1:long_name = "Data kvalitet hos zonal vindhastighet nära marken" ;
uas_qc_locale1:flag_meanings = "kvalitet_godkänd ickefungerande_sensor utanför_godkänt_intervall" ;
double uas_qc_locale2 ;
uas_qc_locale2:long_name = "Qualité des données sur Vitesse du vent zonal en surface" ;
uas_qc_locale2:flag_meanings = "qualité_bonne capteur_non_fonctionnel plage_valide_extérieure" ;
double uas_qc_locale3 ;
uas_qc_locale3:long_name = "Calidad de datos de la velocidad de viento zonal en superficie" ;
uas_qc_locale3:flag_meanings = "calidad_buena sensor_no_funcional fuera_rango_válido" ;
// global attributes:
:Conventions = "CF-1.8" ;
:title = "This is a test" ;
:locale = "en-US" ;
:localizations = "sv: g_locale1 fr: g_locale2 es-MX: g_locale3" ;
data:
lat = 0, 5, 10, 15, 20 ;
lon = 0 ;
uas =
1,
2,
4,
48,
160 ;
uas_qc =
0,
0,
0,
2,
1 ;
// container variables, contents immaterial:
g_locale1 = _ ;
g_locale2 = _ ;
g_locale3 = _ ;
uas_locale1 = _ ;
uas_locale2 = _ ;
uas_locale3 = _ ;
uas_qc_locale1 = _ ;
uas_qc_locale2 = _ ;
uas_qc_locale3 = _ ;
}
@JonathanGregory strong disagree that needing to search the attributes of referenced variables is not general CF pattern. It is very pervasive and probably even a fundamental CF pattern with usage prominently in ancillary_variables
and coordinates
attributes. Ancillary variables contain status flags, uncertainties (e.g. standard error), etc... There are 19 standard names that say you need to use the ancillary_variables
to figure out linkage. Things like angles, wavelengths, vertical extents are all referenced in coordinates, there so many example of where coordinates are used that I couldn't figure out how to get an accurate count from the standard name able by searching it (its in the hundreds). For all of these, you won't know what the referenced variable contains until you read their attributes.
In your specific example, I'm not sure I like the lack of locale on the localization variables themselves. Without the context of the referencing variables, what locale should I assume they are? Would it be the global locale attribute which is "en-US" in this case? Continuing with that line of thinking, I don't think the repeated "en-US" locale on the uas
and uas_QC
variables is necessary.
Re the key: value
pattern, you are correct that I really don't like it, and I find issue not with the "ease" of parsing, but that we are asking someone to write their own parser at all:
key: value
pairs.The most recent addition of something that looks like key: value
is the units metadata, which avoids the problems by defining three exact strings that only look like key: value
pairs.
I feel somewhat strongly that if CF wants to continue to use its own bespoke syntax for these string attributes, it needs to define the grammar of them formally in some way, e.g. using EBNF/ISO 14977. I don't think that a regex would be acceptable here either.
Good morning, @DocOtak
Regarding your comment that the need to search the attributes of referenced variables is a pervasive CF pattern, whereas I said that it isn't CF-like. I'm sorry that I didn't consider this remark more carefully! You're right that there are cases where an attribute names several variables (such as ancillary_variables
and coordinates
, as you say) and you have to inspect them to find out which is which.
I was thinking instead of the situations where an attribute identifies variables by their purpose, such as formula_terms
and cell_measures
, which use the "keyword:
value" syntax. Where we can do it, this method seems more convenient to me, because it's easier to find what you need.
In my version of the localization example, you have the data variable uas
, which has a long_name
, and the locale
attribute tells you the long_name
is English. If you want a French version, you inspect the localization
attribute to see if it has a fr:
keyword. You find that it does, and the value uas_locale2
names the variable which contains the long_name
attribute in French. On the other hand, you can see there is no de:
keyword in the localization
attribute, so without inspecting any variables you know that the long_name
doesn't have a German version. I think that's convenient.
My version is no more than a rearrangement of yours. I have replaced the locale
attributes of the container variables with the keywords of the localization
attribute of the referencing variable - that's all.
I don't think the container variables need a locale
, because they are subsidiary variables, the way I understand it. They are adjuncts to the referencing variable. They host alternative versions of some of its attributes. I regard this as like boundary variables, which are subsidiary to coordinate variables, and therefore they don't need metadata of their own in general. They're adequately described by the variable which references them.
I agree with you that we don't need the locale
attribute of the data variables uas
and uas_QC
if we say that the file attribute locale
supplies a default locale for all data variables, as well as the locale for any other file attributes. That would be even simpler and better, and I would prefer it.
Best wishes
Jonathan
Dear Andrew @DocOtak
You made a good point that "attributes are key-value pairs". We use that idea for various kinds of container variable, such as grid mapping. Here's a modified version of my previous example (itself a modified version of yours), in which I use a "supercontainer" variable, instead of an attribute containing key-value pairs, to point to the localized metadata containers. The supercontainer may have any attribute name which is a legal language tag, and no other attributes.
Do you prefer this? I have also assumed that the file locale
attribute is a default for data variables, as discussed above.
Best wishes
Jonathan
netcdf test_LB_b {
dimensions:
lat = 5 ;
lon = 1 ;
variables:
double lat(lat) ;
lat:standard_name = "latitude" ;
lat:long_name = "latitude" ;
lat:units = "degrees_north" ;
lat:axis = "Y" ;
double lon(lon) ;
lon:standard_name = "longitude" ;
lon:long_name = "longitude" ;
lon:units = "degrees_east" ;
lon:axis = "X" ;
float uas(lat, lon) ;
uas:_FillValue = 1.e+20f ;
uas:standard_name = "eastward_wind" ;
uas:long_name = "Zonal Surface Wind Speed" ;
uas:units = "m s-1" ;
uas:ancillary_variables = "uas_qc" ;
uas:localizations = "uas_localizations";
float uas_localizations;
uas_localizations:sv="uas_locale1";
uas_localizations:fr="uas_locale2;
uas_localizations:es-MX="uas_locale3" ;
byte uas_qc(lat, lon) ;
uas_qc:_FillValue = -128b ;
uas_qc:long_name = "Data quality of Zonal Surface Wind Speed" ;
uas_qc:standard_name = "status_flag" ;
uas_qc:valid_range = 0b, 2b ;
uas_qc:flag_values = 0b, 1b, 2b ;
uas_qc:flag_meanings = "quality_good sensor_nonfunctional outside_valid_range" ;
uas_qc:localizations = "uas_qc_localizations";
float uas_qc_localizations;
uas_qc_localizations:sv="uas_qc_locale1";
uas_qc_localizations:fr="uas_qc_locale2;
uas_qc_localizations:es-MX="uas_qc_locale3" ;
double uas_locale1 ;
uas_locale1:long_name = "Zonal vindhastighet nära marken" ;
double uas_locale2 ;
uas_locale2:long_name = "Vitesse du vent zonal en surface" ;
double uas_locale3 ;
uas_locale3:long_name = "Velocidad de viento en superficie" ;
double uas_qc_locale1 ;
uas_qc_locale1:long_name = "Data kvalitet hos zonal vindhastighet nära marken" ;
uas_qc_locale1:flag_meanings = "kvalitet_godkänd ickefungerande_sensor utanför_godkänt_intervall" ;
double uas_qc_locale2 ;
uas_qc_locale2:long_name = "Qualité des données sur Vitesse du vent zonal en surface" ;
uas_qc_locale2:flag_meanings = "qualité_bonne capteur_non_fonctionnel plage_valide_extérieure" ;
double uas_qc_locale3 ;
uas_qc_locale3:long_name = "Calidad de datos de la velocidad de viento zonal en superficie" ;
uas_qc_locale3:flag_meanings = "calidad_buena sensor_no_funcional fuera_rango_válido" ;
float localizations;
localizations:sv="g_locale1";
localizations:fr="g_locale2;
localizations:es-MX="g_locale3" ;
double g_locale1 ;
g_locale1:title = "Detta är ett test" ;
double g_locale2 ;
g_locale2:title = "Ceci est un essai" ;
double g_locale3 ;
g_locale3:title = "Este es un ensayo" ;
// global attributes:
:Conventions = "CF-1.8" ;
:title = "This is a test" ;
:locale = "en-US" ;
:localizations = "localizations";
data:
lat = 0, 5, 10, 15, 20 ;
lon = 0 ;
uas =
1,
2,
4,
48,
160 ;
uas_qc =
0,
0,
0,
2,
1 ;
// container variables, contents immaterial:
g_locale1 = _ ;
g_locale2 = _ ;
g_locale3 = _ ;
uas_locale1 = _ ;
uas_locale2 = _ ;
uas_locale3 = _ ;
uas_qc_locale1 = _ ;
uas_qc_locale2 = _ ;
uas_qc_locale3 = _ ;
}
Moderator
TBD
Moderator Status Review
None
Requirement Summary
Metadata includes natural language text in several places, notably the
title
andlong_name
attributes, as well as potentially in character data variables. Other metadata standards, such as ISO-19115, support the translation of these variables and translation is mandatory in some places such as in files generated by the Canadian Government. By standardizing how these elements are specified in a fashion that is both human-readable and machine-readable, users can identify metadata in their preferred language more easily and computer applications can display metadata to match users preferences and, where this is not possible, then at least while using appropriate accessible techniques. Of key importance is also compatibility with applications such as ERDDAP, which is an application that uses NetCDF files following the CF conventions to create a web interface to select and download data. For this reason, we decided not to use the new.fr-CA
suffix as a required format as it would not be compatible with ERDDAP - instead, data providers are free to choose suffixes that meet their use case.Technical Proposal Summary
Based on discussions in https://github.com/cf-convention/discuss/issues/244, the following proposal seemed acceptable: (1) the creation of a new global attribute that maps suffixes to BCP 47 language tags as well as specifying the default language tag in the file, (2) designating that any attribute or data variable with such a suffix is a localized version of the text in the non-suffixed attribute or variable.
Benefits
Data producers who are required to produce metadata or data in multiple languages, applications that offer multilingual interfaces for viewing or manipulating NetCDF data based on the CF standards, data users who wish to access metadata in the language of their choice
Status Quo
Currently no NetCDF standard offers a standard for localized metadata. However, such standards existing in other metadata formats, such as ISO-19115.
Associated pull request
Not present yet
Detailed Proposal
The addition of a new section to the CF conventions that specifies the following:
localizations
, which will be a space-separated list of paired suffixes and BCP 47 language tags (similar to howcell_methods
is formatted): for example:localizations = "default: en-US _fr: fr-CA _es: es-MX";
default
instead of a suffix to indicate the default locale of the documentlocalizations
attribute is present, attributes and variables may not be named with a suffix except that they indicate localized versions of non-suffixed attributes or variablesIn addition, the following changes would be proposed:
title
,comment
,institution
,long_name
,references
,sources
and to be discussedhistory
andflag_meanings
)