Support localized metadata on ERDDAP HTML pages

ERDDAP / erddap

ERDDAP is a scientific data server that gives users a simple, consistent way to download subsets of gridded and tabular scientific datasets in common file formats and make graphs and maps. ERDDAP is a Free and Open Source (Apache and Apache-like) Java Servlet from NOAA NMFS SWFSC Environmental Research Division (ERD).

Creative Commons Zero v1.0 Universal

84 stars 60 forks source link

Support localized metadata on ERDDAP HTML pages #114

Open turnbullerin opened 1 year ago

turnbullerin commented 1 year ago

Hello fellow ERDDAP folks!

Recently, I've been championing an initiative over with the CF folks on getting a standard for localized metadata into CF (see https://github.com/cf-convention/discuss/issues/244) which has been paired with a discussion on expanding attribute and variable names to allow for a full (or greatly expanded) Unicode character set (https://github.com/cf-convention/cf-conventions/issues/237). Of note, the latter discussion is simply opening the CF conventions to use what NetCDF already allows, as NetCDF files are allowed to have attributes with any Unicode characters already.

As I also work on our ERDDAP server a lot, I wanted to draw your attention to these discussions because I noted that the current ERDDAP configuration only allows attribute names containing [A-Za-z0-9_] characters (I get a RuntimeException: [variable] isn't variableNameSafe when I put square brackets in for example). I recognize ERDDAP doesn't only work with NetCDF files and so there may be other restrictions than what NetCDF/CF will allow, but with the CF conventions moving towards allowing a full Unicode set (and ERDDAP's metadata is based on what CF/ACDD define) I thought it would be worth having a discussion on expanding the character set allowed in attribute and variable names and that some of you folks might want to weigh in on the CF discussion before it is finalized.

Part of why I have been championing that work is that I would love to see ERDDAP able to take localized metadata from a dataset and integrate it into the translation mechanism. Right now, there isn't a way to display a French title for a dataset when browsing the website in French (something that Canadian laws require for us to be able to use ERDDAP at the federal government level). I've made a hacky solution in Javascript that got me past the requirement, but having a proper internationalization solution for datasets in ERDDAP would be highly useful for me and probably others. I see the CF work as setting the foundation for this by defining a standard for encoding the different titles and such into the files themselves and I hope ERDDAP will pick that up in a future release (and would be happy to contribute myself to it).

MathewBiddle commented 1 year ago

From my experience the main hangup was downloading the .mat file for Matlab. Matlab has very specific requirements for variable names that make this a difficult ask. See https://www.mathworks.com/help/matlab/matlab_prog/variable-names.html

turnbullerin commented 1 year ago

Oh MATLAB :(.

That said, do MATLAB files have attributes? Could we relax the restrictions on attribute names and source names while keeping them on destination names and enforcing that if a source name contains an invalid character, a destination name must be provided? Or maybe automatically create one by removing invalid characters or replacing them with "_"?

turnbullerin commented 1 year ago

I did some research on other file formats, here's what I found:

DAP2: allows [0-9A-Za-z_!\~*'"-] and other US-ASCII if URL-escaped; Special Characters: =<>!+-/\*~%.[]

DAP4: UTF-8 characters (escaped if not US-ASCII); Special characters: /

HDF5: UTF-8 supported, ASCII default

ASCII, CSV, TSV: character-encoding dependent but all valid characters allowed (with proper escaping)

KML: depends on coding, <& and either ' or " must be escaped and non-printable control characters and compatibility characters are discouraged: https://www.w3.org/TR/xml/#NT-Char

ESRI: Strongly recommended [A-Za-z0-9_-], explicitly not allowed: +*/!^%()[]{},~'":;><&|\=@#$

So ESRI CSV might also be an issue with the variable names but it is solved by using a similar approach to MATLAB

EDIT: Fixed escaping of formatting characters

MathewBiddle commented 1 year ago

Or maybe automatically create one by removing invalid characters or replacing them with "_"?

I think GenerateDatasetsXml has some similar logic in it. But, we're reaching the capacity of my knowledge.

https://github.com/ERDDAP/erddap/blob/468e2b85d2c2484024f1418619f35bbe01b27a94/WEB-INF/classes/com/cohort/util/ScriptString2.java#L704

BobSimons commented 1 year ago

Wow! There are a couple of big topics there! I'll try to deal with them separately below...

First: I am basically in support of Unicode/localization. The question is how to get what you want and how to make the usage clear to users.

Unicode Attribute Values - Note that ERDDAP already supports Unicode attribute values as much as it can (e.g., some outgoing file types don't support Unicode).

Unicode Attribute Names (Identifiers) - What you are asking for, notably in your example, is not just Unicode letters, but Unicode punctuation. Yes, nc4 files support this, but that is a special case. I said to Unidata at the time that I thought this was a bad idea. They may get away with it in nc4 files because the special characters have no meaning in the file (although what if there is a slash in an attribute name which is in a group?). The problem is that things get super complicated and cause problems when you allow punctuation and when you go outside of nc4 files. The main question is: what characters have special meanings in which situations? For example, if you want colons to indicate a namespace prefix is being used, then how do you deal with names that don't have a namespace but do have a colon in the name? Or, if you want slashes to be separators for groups (which is allowed in CF now), how do you deal with slashes in a group name? And you can probably imagine that comma, spaces, newline, #0, tab, and undefined characters will cause problems in various situations (CSV files, TSV files, Matlab files, etc.). As soon as you say all Unicode characters are allowed (maybe with specific exemptions like colon and slash, comma, newline, ...), then you are saying there can never be any addition to the special characters in the future, because there will be names in existing datasets which already use those special characters. That would be bad. That said, I am much more open to allowing some subset of Unicode characters corresponding to characters that are letters (or ideograms which are words). That would let you have, e.g., French letters or Chinese words in names, but not punctuation. But Unicode is huge, There would have to be some simple, standard way for CF and software like ERDDAP to easily identify these valid characters. Is there such a standard? I am basically fine with that compromise if problems can be worked out (can OPeNDAP be made to handle this??? I think not. I made suggestions for a DAP 2.1 (e.g., Unicode attribute values, long ints, and unsigned integer types) but was firmly told "no!")). (Note that it is a big project because of all the situations where ERDDAP publishes metadata, e.g., web pages, file formats (e.g., .das), other software (e.g., DAP libraries), and outgoing file types, many of which don't support Unicode or punctuation). But it may be possible (but maybe it isn't). What do you think?

I'll add to the above: as soon as you allow punctuation characters in names/identifiers, you open up security concerns and it is very difficult to foresee all of them (it was beyond the capabilities of all of the computer security people who got it wrong for so many years). Things which seem so simple (e.g., that all you need to percent-encode in a URL are a few special characters (&"#')) can be horrible security problems. There are reasons why identifiers in computer languages have strict requirements for valid characters (e..g., _a-zA-Z) in identifiers.

Localization (text appearing in different languages in different situations, notably on lang=FR web pages or for data requests which specify a language) - This is a huge/complicated project. What if the requested language isn't available? What if the requested file format doesn't support Unicode? And for standardized attribute names (title, summary, infoUrl, etc), there would need to be official translations of each of the names for each possible language. You'll have to get the standard organizations to do that (good luck!) and even then it just makes things very complicated (or not work), e.g., software like ERDDAP that looks for those standardized names (e.g., title, summary, etc) to extract specific information about the dataset. Eek! That is a messy, difficult/impossible project. Let me suggest an alternative that you can do right now in ERDDAP: make variants of datasets (one for each language, e.g., MUR41_en, MUR41_fr) which use the standard attribute names, but have translated versions of the attribute values. Note that ERDDAP's datasets.xml let's you redefine all of the metadata used by a dataset, so you can use the same underlying data source (e.g., files), but clearly and simply, change all of the metadata values for each variant. (Even better, if you use dataset type=EDDGridFromDap or EDDTableFromDap for all of the language variants, then ERDDAP will handle this very efficiently because it only needs to, e.g., read the data files, for the original dataset.) Then a user can make a request to MUR41_fr and they will get the French version of the metadata and the metadata will be CF and ACDD compliant (i.e., with the English attribute names). And users can easily find out which language variants exist by the existence/absence of a dataset with the appropriate name. This requires no changes to ERDDAP, CF, ACDD, or any other standard or file type. And it is super clear to users what language they can use. And it is super clear when a user says "I worked with MUR41_fr" that they worked with the French version of the dataset. This gets you 95% of what you want (just no translated attribute names). I think this is a vastly better approach than trying to get CF, ACDD, other software, etc to support official translations of the defined attribute names, and trying to change ERDDAP to support different languages in different situations. And you can do it right now. Your thoughts?

I hope that addresses all of the big topics. If I missed something or you want to redirect me, please let me know.

BobSimons commented 1 year ago

Erin, regarding your subsequent comments:

Yes. As you note, Matlab isn't the only troublesome outgoing file type. This is a troublesome problem because so many file types, standards, software, etc. where created before Unicode was widely supported. We can change ERDDAP (perhaps at great cost), but I/Chris can't change all the other standards, file types, software, etc. I had already made the changes that could be done with moderate effort and acceptable consequences (i.e., allowing Unicode in attribute values).

ERDDAP already allows any Unicode character in a source attribute names. It is just destination attribute names which have the stricter requirements. GenerateDatasetsXml already has code to check for not-allowed characters and automatically generate a destination attribute name which is valid. Note that further changes are a very complicated issue because the methods that make identifiers "safe" are used in lots of situations in the code, so simply changing those methods would lead to all kinds of problems. You have to make changes in a way that only affects what you want to change and you have to know/understand/check all of the ramifications (e.g., on all web pages and outgoing file types).

The best solutions (I think) are the ones I proposed in my first email. Well, in some ways, the best solution is no standards or software changes at all. You can do much of what you want (e.g., full Unicode support in attribute values, and localization via different datasetIDs) right now with no changes to any standards or software.

Regarding localized names/identifiers in general: I'll point out that identifier names in e.g., ISO 19115/19139, only exist in one language. That is true of all other XML schemas that I know of. Further, there are strong limitations on the characters allowed in identifiers in XML in general (I'm pretty sure). The same with computer languages (there isn't a French version of C++, Java, Python, or any computer language). Yes, it is a different issue (or at least a different realm), but the point is the same. Sometimes it is best to just pick a language for a task (e.g., names for CF attributes) and stick with it. (I know, easy for me, an English speaker to say.) But a fully localized world (3000 language variants of all software, software languages, standards, etc) just isn't feasible. It is a slippery slope as soon as you allow a second language.

BobSimons commented 1 year ago

I'll add another compromise to consider vs allowing all Unicode characters in attribute identifiers/names: allow all of the letter characters between 128 and 255 in ISO-8859-1, which is the single byte character set that has all of European accented characters (and some other characters) in positions 128-255. Those characters (and their numbers) are consistent with the first code page (0 - 255) of Unicode. Several places in ERDDAP use this encoding when the original specification says "ASCII" (which just defines characters 0-127) because it rarely causes problems and allows support for all of the European languages. It is obviously an imperfect solution (it doesn't support all of the languages which use other characters), but it is an easier-to-implement solution. (e.g., it would be easy to identify all of the allowed characters in this range and document it in the ERDDAP documentation) which provides some benefit.

rmendels commented 1 year ago

I'll have to understand more about Unicode to fully follow this, but for R variable names:

A variable name must start with a letter and can be a combination of letters, digits, period(.) and underscore(_). If it starts with period(.), it cannot be followed by a digit.
A variable name cannot start with a number or underscore (_)
Variable names are case-sensitive (age, Age and AGE are three different variables)
Reserved words cannot be used as variables (TRUE, FALSE, NULL, if...)

So does this mean the proposal would break in R? What about other languages? I thought I saw that it would almost work in Python, but almost sort of doesn't do it when a user can't read a file. How about Javascript?

I would add that I fully understand the underlying rationale for the proposal. Besides the work that would be involved as well as the security concerns Bob raises, I would want a better understanding of what would break on the user end of things.

BobSimons commented 1 year ago

Roy, you bring up a few slightly different issues.

I was talking about attribute names. You and Erin are also talking about variable names. The ideas are basically the same, but variable names are more likely to become identifiers in R, Python, and other programming languages when a chunk of data is read in.

You are correct that R allows "letters" to be in variable names, but the definition of "what is a letter" in R varies based on locale(!). The documentation says "The definition of a letter depends on the current locale: the precise set of characters allowed is given by the C expression (isalnum(c) || c == ‘.’ || c == ‘_’) and will include accented letters in many Western European locales." (Talk about not very helpful! This means you can't know which letters are allowed ahead of time unless you know the locale of the user's R installation!) But this is like I was saying: if you want to allow all Unicode letters (and some other characters like ideograms that are words), you have to have some clear definition of which characters you are allowing.

This highlights another complication of this proposed project: ERDDAP may know which Unicode characters are allowed in a given file type (and thus could perhaps sanitize the name), but ERDDAP won't know that a given file is going to be processed later by some language (like R or Python or ...), doesn't know when a given attribute or variable name is going to be auto-converted into an identifier in that language, and so doesn't know which characters will be valid there. That's why the letters a-zA-Z is the only safe set of characters: they are (I think) always supported (there are probably exceptions). Even the ISO-8859-1 letters will cause problems in some places (so I probably shouldn't have proposed that as an option). Punctuation characters (like the square brackets that Erin wanted) are almost always trouble (e.g., '[' and ']' are interpreted as identifying array subscripts in almost all computer languages).

I should also have said earlier: there are many versions of Unicode. Java supports one of the versions of the UCS-2 (a group of characters identified by 2-bytes). So a proposal to CF would need to include a definition of "Unicode". But as I've said, I think allowing all Unicode characters (as netcdf does for nc4) is a bad idea.

I'll add/emphasize: these issues are generally not a problem for attribute values, which get read into R, Python, etc as Strings. Most computer languages now support at least UCS-2 versions of Unicode as valid characters in Strings. And punctuation characters don't cause problems in Strings. Thus attribute values using Unicode characters, which are already supported by ERDDAP, are generally not trouble when the file is used subsequently. There are exceptions, e.g., DAP, but ERDDAP generally deals with them by converting the string to the ASCII version of the string with characters 128+ rendered as \u plus their 4-digit hexadecimal Unicode number as they would in Json, e.g., the Euro character appears as \u20ac.

As I said, the current Unicode support in ERDDAP gets you most of what you want. Additional Unicode support (e.g., in attribute and variable names) leads to lots of difficult/intractable problems.

rmendels commented 1 year ago

Thanks. I should have made clear that the proposal at the present ongoing CF meeting was for variable names also. I haven't thought it through, but it is one thing when you open a file on your desktop that has variable names with all the possible Unicode characters, another when that is put into an URL. The rationale I believe was to have localized versions of the conventions.

I should also add the obvious that an enforced subset of CF is still CF compliant, and ERDDAP has more localizations than most things (how many other web pages have drop down menus where you can change the language of the pages.). So a lot will depend if CF requires that support in variable names as opposed to allows that support.

I do like your suggestion for the French and English versions of datasets, You take the existing xml snippet, change the datasetid, use xml to change the variable names and attributes, et voila you have French and English versions.

J'aime votre suggestion pour les versions française et anglaise des ensembles de données, vous prenez l'extrait XML existant, modifiez le datasetid, utilisez XML pour modifier les noms et attributs des variables, et voilà, vous avez des versions française et anglaise.

I hope you like what I did there.

BobSimons commented 1 year ago

I hope CF doesn't go with full Unicode support for attribute and variable names. It is the kind of proposal that looks good in isolation (e.g., a cdm representation of a dataset's metadata) but is terrible in practice. It would lead to all kinds of problems (e.g., prevent various punctuation from being used as special characters, e.g., colons for namespaces) in the future, and cause problems when the data file is imported into some client software where variable names (with square brackets?!) become identifiers). Identifiers/names have traditionally been limited to very small character sets (e.g., letters and underscores) for good reasons.

Getting involved in a CF proposal is a full time job with endless bickering. I found it hellish and won't get involved again. I hope someone else advocates against it. (Sorry Erin)

turnbullerin commented 1 year ago

Some notes:

@BobSimons I agree that there are a lot of issues with special characters. I know DAP 4 has gone for full UTF-8 support with the exception of a slash, but DAP 2 is still on US-ASCII (but allows any US-ASCII character as they can be URL-encoded which is also how DAP 4 will resolve ambiguities in what is a special URL character and what is part of a variable or attribute name). After looking at the output files that ERDDAP currently generates, the MATLAB file, ESRI CSV, and DAP protocols seem to be the biggest limiting factors (and would bring us down to what the current standard is which aligns with MATLAB and ESRI) for variable names and I don't see those changing any time soon. That said, attribute names aren't included in MATLAB or ESRI files as far as I can tell, so I think there is more flexibility there.

Based on the idea that ERDDAP is first and foremost a DAP2 server (with bonus features), I think my proposal would be to start by moving to a full US-ASCII character set (according to DAP2's definition) for variable names and attribute names, with some notes:

Identify and restrict certain characters that could cause issues or confusion. At a minimum, I would suggest 0x00 through 0x1F and 0x7F (the non-printable control characters) be disallowed. In addition, I would consider the set of common math and logic operators !+-\*/\\<>=, HTML special characters &%?#:, the space `, square brackets[], quote", apostrophe'` (and backtick?) carefully. While allowed, I think they might present complications in escaping but we should balance that against clear naming conventions for things like chemical formulas. I would personally eliminate the space and double-quote at least and analyze what the impact of allowing the others in attribute names and variable names would be in each file type ERDDAP can output. We can then decide if it is worth further restricting the character set.
We would then have to make sure each file output has proper escaping for variable and attribute names and (where necessary) a mechanism to convert invalid names to valid names (notably for ESRI and MATLAB). We would need to replace the characters (with underscores?) and then ensure the no-leading-underscore is met and that there aren't duplicates.
We should also add a note to the documentation that we recommend using [A-Za-z0-9_] for full compatibility with all output formats and variable/attribute names may be modified in some formats if this is deviated from.

I assume if DAP4 gets approved, ERDDAP might look at supporting it and when that happens we can address full Unicode compatibility. Unicode does have general categories we can use to simplify life (e.g. ban all control characters which is General Category Cc).

In terms of security issues, this is an assumption on my part but most DAP libraries should have support for proper encoding of URLs since they are allowed in the specification. Since ERDDAP does generate URLs though, we would have to look at where we are generating DAP or other URLs and ensure proper escaping is applied (and then handled properly). This seems a bigger project for variable names than attribute names, so perhaps we can start with attribute names (which is most of my use case anyways) and move on to variable names after? I don't think attribute names appear in URLs in ERDDAP, just in HTML text. Proper escaping in HTML would still need to be applied.

In terms of localization, my proposal to CF was to simply provide "title" (for English or a different locale as specified in the "locale_default" attribute), "title_fr", "title_jp" etc. and have tools like ERDDAP fall back to the default attribute if it wasn't available. All of the available locales are documented in "locale_others" so there is a simple list. This aligns with BCP 47 which deals with localization in web applications where a requested language isn't available. I would not propose translating the attribute names themselves, I think that is a logistical nightmare as you said, but I think it is useful to support a suffix on standard attributes for offering the content itself in different locales.

ERDDAP already offers an internationalized interface which is great. But it's actually an accessibility issue at the moment because it puts English text (from dataset names and details) into a French (or other language) web page without noting that the text is not in the language of the page (violation of WCAG 3.1.2 Language of Parts, an AA criterion that organizations here in Canada are often required to meet and that the US federal government aims for under Section 508). Duplicating the dataset just makes the accessibility worse and, honestly, I would find confusing as a user (two datasets that are the same but in different languages?). What I've done for now is used a pipe to separate the English and French, then wrote a small Javascript tool that separates them, displays the appropriate one, and adds the language attributes as needed to make it accessible. A proper solution will need to understand what the language of the dataset is (maybe assuming its English) and then at least display the language attributes (then duplicating the dataset isn't an accessibility issue at least), but I think it's even better if ERDDAP could understand the metadata in multiple languages and display it.

In ERDDAP terms, this would mean the following:

Read the XML in dataset
Define the default title, summary and long_name for each attribute in the current way but note that they have a locale as noted in locale_default. If there is no locale_default, I think en is a good assumption and we can note that in the documentation.
Read the locale_others attribute, if present. If it is present, split it by spaces and look for whatever convention CF ends up defining for localized metadata (e.g. title_fr if the locale is fr). Note these alternatives with their locale.
When making a request that displays the title, summary or long name to the user in HTML, use the locale requested (as per the language switcher which adds it as a prefix in the URL) to identify the best match from the locales we have (BCP47 has rules on how to do this, but given the limited selection of locales for ERDDAP, the algorithm should be straight-forward). If none match, use the "default" one (i.e. the one without prefix) as I've proposed in the CF conventions.
Display the matching title, summary and long names to the user. If the locale isn't an exact match to the one requested (which should be in the lang attribute on the html tag), then add a lang attribute to the closest html tag to the text with the locale in it (or add a span tag with the lang attribute).

This change will greatly improve the accessibility of ERDDAP in handling alternative languages even if nobody localizes their metadata because it will properly add the lang attribute as needed to English text and make ERDDAP more WCAG compliant (good for us all). It also will mean organizations like mine (Fisheries and Oceans Canada, and other Canadian organizations) will be able to offer a full French language equivalent as required by law here.

In terms of safety and using the method in lots of situations in code, I think I would (if I were writing it), create a new method and replace it where appropriate in code. That way we avoid unintended consequences.

As for ISO-19115, the format itself is unilingual agreed and I'm not proposing we change attribute names in CF/ACDD for exactly that reason - the interoperability is important. I only want to add suffixes for providing the value in different languages. This is something that ISO-19115 does provide via PT_FreeText, for example, and it is then used in tools like CKAN to display the metadata in multiple languages:

<gmd:organisationName xsi:type="gmd:PT_FreeText_PropertyType">
    <gco:CharacterString>Government of Canada; Fisheries and Oceans Canada; Fishery &amp; Assessment Data Section</gco:CharacterString>
    <gmd:PT_FreeText>
       <gmd:textGroup>
          <gmd:LocalisedCharacterString xmlns="" locale="#fr">Gouvernement du Canada; Pêches et Océans Canada; Section des données de pêche et d'évaluation</gmd:LocalisedCharacterString>
       </gmd:textGroup>
    </gmd:PT_FreeText>
 </gmd:organisationName>

In CDL/NetCDF/CF parlance, I am propose we adopt a similar convention but without XML we can't do a nested structure. So instead, my proposal is

organization_name: "Government of Canada"
organization_name_fr: "Gouvernement du Canada"

turnbullerin commented 1 year ago

I would further note that I think localization can be fully separated from the expanded character set support - a naming convention like "title_fr" can be done with the existing attribute naming conventions. However, with the CF workshop groups heavily leaning towards expanding support to full Unicode with exceptions (to align with the NetCDF standard), I thought it worth mentioning it here.

Reading back, I also think there was some confusion over my proposal and I wish to be clear that I don't want to translate CF attribute names, I think that's a nightmare. I just want a convention for having alternative content in different languages, but the attribute names themselves can be English-only. My original thought was to do "attribute_name_locale" but the CF convention people are now discussing if it is worth using the expansion of allowed characters under CF to separate the locale from the attribute name.

That said, it seems like it adds a lot of complexity and I'm going to push them for a localization convention that doesn't require expanding the character set based on this discussion since I think it will take a long time to test this to ensure there are no unintended consequences and I'd rather work with small changes to see localization happen faster and worry about Unicode/US-ASCII later.

BobSimons commented 1 year ago

Wow^2! So many issues. If convenient, for simplicity, please make inline responses to my responses.

I think supporting additional characters, notably new non-letter characters, in variable and attribute names is a bad idea. Chris can weigh in. Like all computer languages, most analysis software (R, Python, Igor), standards (CF, ACDD, DAP, XML), etc that I know of, ERDDAP restricts the characters to _a-zA-Z (others sometimes support 1 or 2 additional characters, which ERDDAP deals with by being more strict to avoid trouble). Note that parsing algorithms in various places rely on these definitions, so changing the definition in ERDDAP will cause all kinds of problems in all kinds of software. It doesn't make sense to modify ERDDAP in a way that goes against the standards and causes all kinds of problems when exporting data to those other systems. Plus, no one is smart enough to foresee all of the consequences of changes like this.

Plus, it would be in violation of the CF, ACDD, DAP, etc. standards. As Roy said, if ERDDAP (when it emits data files) supports fewer characters in identifiers, ERDDAP is still in compliance with CF, ACDD, and DAP, but if it supports more characters, ERDDAP isn't in compliance. Sometimes strict compliance doesn't matter, but in this case I think it does. If you get CF, ACDD (which is inactive), DAP (which has refused to make DAP 2.1), and others to change, then I will think it is a valid thing to consider (but I still think it is a very bad idea). (Note that I don't consider "a working group is considering this" to be anywhere near "the new version of CF supports this".)

CF used to require that proposals include a real life example(s) of why the change is useful and needed. That is a great idea. Can you please give me examples of what you want to do with added chars in identifiers?

Next issue: You say "Duplicating the dataset just makes the accessibility worse and, honestly, I would find confusing as a user (two datasets that are the same but in different languages?)." Why is accessibility worse? You can explicitly say on your ERDDAP home page that "All datasets with datasetID's that end in _fr are identical to the _en datasets except that the attribute values are in French". A French speaking user will choose to work with the French version of the dataset. With this approach, you get to translate all of the attribute values, you don't need any changes to the CF or ACDD (which is inactive) standards, or ERDDAP and you can do it today. (Doesn't that solve all the problems? Hallelujah!) If you don't want French users to see any English (and vice versa), then set up 2 ERDDAPs, one with the English datasets and one with the French datasets. (You could even be fancy at the Apache/Tomcat level and direct all requests to https://...erddap/fr/... to the French ERDDAP and perhaps vice versa.) I still think this is a good idea. Please tell me why you think it isn't.

I'll point out that the European Union has 24 official languages (and counting) which means that the metadata for a dataset might become quite voluminous.

As with the characters-in-identifiers issue, I don't think ERDDAP should be the trailblazer because it makes ERDDAP not standards compliant. If you get CF, ACDD (which is inactive), and others to add support for e.g., _fr at the end of attribute names to identify the language, then I will think it is a valid thing to consider. I think this is a reasonable proposal (although it is not necessary if you make separate datasets).

You wanted to add localized titles via title_fr, title_jp, etc. That seems like a slippery slope and doesn't solve the problem you said you wanted to solve: offering e.g., French versions of all of the metadata. Don't you want/need all the other text attributes to be localizable with this system? If so, then please be straightforward and say what you want. If not, then tell me why you don't want e.g., summary to be treated the same way.

larsbarring commented 1 year ago

What would the implications be for ERDDAP if CF expands the character set allowed for attribbute names to include hyphen -, as well as either period .or the two square brackets [ ]? See recent comments in https://github.com/cf-convention/discuss/issues/244 for background.

rmendels commented 1 year ago

@larsbarring

This is more a Bob question, I do not know enough of all the ins and outs of the code. Remember Bob is retired and deals with these when he feels like it (and well he should), so a response may be a couple of days in coming.

larsbarring commented 1 year ago

Just for convenience, if the CF conversation moves on, here is what I meant with "... recent comments ...": https://github.com/cf-convention/discuss/issues/244#issuecomment-1773572248 https://github.com/cf-convention/discuss/issues/244#issuecomment-1773858500 https://github.com/cf-convention/discuss/issues/244#issuecomment-1773861426 https://github.com/cf-convention/discuss/issues/244#issuecomment-1775087332 https://github.com/cf-convention/discuss/issues/244#issuecomment-1775257647 https://github.com/cf-convention/discuss/issues/244#issuecomment-1775623793

And for an alternative approach: https://github.com/cf-convention/discuss/issues/244#issuecomment-1773636422 https://github.com/cf-convention/discuss/issues/244#issuecomment-1775281409 https://github.com/cf-convention/discuss/issues/244#issuecomment-1776079418

BobSimons commented 1 year ago

@larsbarring, I just read the very first part of the CF discussion. Of the choices you are considering, I would recommend requesting a change to allow attribute names like title_fr_ca, because it is a simple extension of the existing attribute names (so it's easy to read and understand) and requires no other change (i.e., allowing other characters in attribute names).

The problem with allowing the characters ".-[]" in attribute names is that those characters are already used for other purposes in languages (like Python,etc), software (R, DAP 2.0, etc), and file types (.mat, etc) when the data from ERDDAP gets into those languages/software/files, notably when those attribute names would be converted into identifers (e.g., datasetID.variableName.attributeName) where those characters are not allowed in identifiers and are already reserved for other purposes: "." is used as a separator to show parent.child relationships. "-" is used for negation and subtraction. "[]" are used for dimensions. Since we aren't going to change R, DAP 2.0, Python, etc., making this change to CF and ERDDAP seems pointless and troublesome.

Note that ERDDAP already deals with a similar problem: some file types (e.g., Matlab's .mat) only support short variable names. ERDDAP deals with this by shorting longer names in a way that makes it unlikely that similar long variable names will be converted to the same short name. But solutions like this are not ideal because ERDDAP is promising one thing (certain variable names) and returning a file with something else (shorter, different names). So ERDDAP could sanitize ".-[]" characters, but it would have to do so in many places, for widely used languages/software/file types, and that would be a really kludgy solution.

Zooming out: I think your request to allow ".-[]" in attribute names is a bad idea because it goes against a widely used standard in the computing world (languages, file types, software, etc): the characters used in identifiers are quite limited (generally they must start with _a-zA-Z and then include just _a-zA-Z0-9 ). Going against this is inviting trouble (as shown by the examples above).

I'll add: if e.g., title_fr_ca becomes valid in the CF world, I think it would be straight-forward to add support for it in ERDDAP, given ERDDAP's existing support for other languages in the web interface. For example, in some situations (e.g., the table of matching datasets which is returned when users do a search for datasets of interest), ERDDAP could display the appropriate title_x_y and summary_x_y variant as the dataset's title and summary. Although, in many circumstances (the attribute lists at the bottom of the Data Access Forms), I think all of the metadata should be shown as is.

I hope that is clear and makes sense. If I need to address some specific point from the CF discussions, please let me know.

Best wishes.

larsbarring commented 1 year ago

@BobSimons thank you for your explanation. I am aware of that all the suggested characters have special, and different meaning in various languages, as have any number of reserved words, and more. I think that it depends on the programming style (even paradigm) how these matters are handled.

If the example pattern you provide ("datasetID.variableName.attributeName") is indeed used in ERDDAP I think that datasetID.variableName.attributeName.locale seems like a natural extension for those attributes that have a localized version, whereas datasetID.variableName.attributeName_locale breaks this pattern, and may be more complicated by the fact that the attributeName part may have one or several underscores, and the locale part may have one underscore.

Anyway, I have no insight in inner workings of ERDDAP so i am not commenting on this, but do understand that it might be a substantial task to make the necessary changes. To round off I would like to mention that there are now and then (and as I reckon increasingly often) well motivated requests from various communities to relax the CF restrictions regarding which characters are allowed in variables and attributes.

BobSimons commented 1 year ago

I think attributeName.locale as an attribute name would cause problems in e.g, R and Python, because the .locale would be part of the attribute name. So it would become something like datasetID.variableName."attributeName.locale", which is messy/confusing.

I see your point about the possibility of _ in an attribute name that then also has _language or _language_locale. But 1) Presumably, ERDDAP would only look for/ care about language specifications added to specific attributes (e.g., title, summary). 2) Perhaps the revised suggestion is to use notation like title__fr_ca (2 underscores before the language id.

I am more open to adding support for e.g., accented characters from ISO-8859-1 in attribute names, but even this will probably cause a lot of problems in other software, making it a bad idea. Again, other software tends to be very restrictive about what characters are allowed in identifiers. If CF changes, then we'll probably change ERDDAP, but you'll never change Python, R, Matlab, ESRI files, Igor, DAP 2.0, etc., so changing CF looks like a bad idea to me.

I remain doubtful that allowing other punctuation characters is a good idea. Punctuation characters are often used for other purposes in other software and, again, most software is very restrictive about allowing punctuation characters in identifiers.

One of my objections to CF in general when I tried to work with them (a disaster) was that it was mostly run by scientists, with little input or care for what software developers thought (at least what I thought). So they may decide that allowing additional characters makes sense to them (and it may be fine in the CF/netcdf bubble), but it may cause lots of problems in the wider software world (Python, R, Matlab, ESRI files, Igor, DAP 2.0, etc.).

Best wishes.

Dave-Allured commented 1 year ago

Hello. I am the advocate for CF issue #237 to remove character set restrictions on the names of netCDF variables, attributes, etc.

I appreciate the concerns about breaking existing ERDDAP code and applications. Would it be feasible to implement some kind of modifier to ERDDAP client requests, such that a knowledgeable application could request original, unmodified netCDF object names, and skip the name sanitizer? Existing applications, both internal and external, would be safe because they would continue to be exposed to only the default, sanitized names.

A scheme like this would allow gradual migration for localization as well as other internationalization and naming strategies.

BobSimons commented 1 year ago

I think you are missing my main point: Yes, these changes will cause problems within ERDDAP (e.g., because of DAP 2.0 limitations) that are hard to deal with, but the far bigger issue is that these changes are incompatible with major external applications (R, Python, Igor, Esri, etc.) where the data is actually used. Maybe we can find solutions to the problems in ERDDAP, but you will never find solutions / make changes to R, Python, Igor, Esri, etc.). You're going against a tradition of limited characters in identifiers that has always been the dominant system in the computer world.

I don't get why you are so adamant about this when it isn't needed for localization and internationalization. Allowing all Unicode characters in String data and in attribute values is what you really need and we largely have that (other than some legacy file types). Why isn't your problem solved (although not in the way you want) with e.g., title_fr and then attribute values that allow Unicode? That is a tiny change to CF (that doesn't require a change to the characters allowed in attribute names) and causes no problems with all of the external computer languages and software.

Dave-Allured commented 1 year ago

@BobSimons, no, I am not missing your main point. I am suggesting some kind of ERDDAP bypass mechanism for aware software that will completely avoid inserting raw netCDF names into code and command namespace. I imagine this would not be very difficult for ERDDAP, but I do not know ERDDAP internals.

My proposed changes are not at all incompatible with various programming languages. They merely require some discipline to keep data names inside string variables, rather than in code namespace.

Developers of core data formats such as HDF5 and netCDF went out of their way, many years ago, to enable a very wide character set for data storage names. Many users would like to utilize more parts of that character set. CF and related standards are blocking that. I understand the temptation to insert raw variable names into code namespace. This is how COARDS, CF, and ERDDAP evolved. However, IMO this particular "tradition" is flawed. I think it is time to step forward and embrace a modern character set and a more flexible way of working with data names.

Erin's localization proposal is a tiny subset of the character set debate. It bothers me a little that this has been conflated with the full UTF-8 proposal. However, here we are. The core technical issue is the same in both cases -- how to safely support an expanded character set.

rmendels commented 1 year ago

@Dave-Allured @BobSimons

When this first came up I suggested it would sure be nice to have this tested with the major netcdf (and ERDDAP) clients, exercising their full capabilities, before this suggestion becomes part of the standard. At least be certain what does and doesn't work, so a decision is made with all the facts. As I have said, I understand where the request is coming from, but I don't understand all of its consequences, and what I know of some of the clients suggests that they will break, but I don't know that for certain, and even more I don't understand the rush to come to a decision. Let's be certain of all of these types of things first.

BobSimons commented 1 year ago

What a mess. These conversations (especially in the CF mailing list, but here, too) always end up with all kinds of misunderstandings and mischaracterizations, debating different options at the same time.

@Dave-Allured, I'm sorry I said you missed my main point, but you did express your "concerns about breaking existing ERDDAP code". But I don't like your main solution (a switch to request sanitized or unsanitized variable and attribute names) because that is way too messy -- it presumes people always know ahead of time in which applications the data file will be read the limitations of that application. And requesting "sanitized" names is messy because different applications will need different sanitation procedures. That is largely why ERDDAP settled long, long ago on the naming conventions that it has. The fact that you will never change how Python, R, Igor, DAP 2.0, etc work should convince you that these proposals to allow more characters is a bad idea. When viewed in the isolation of nc4 and hdf5 files, the new characters are appealing and cause no problems -- the problem is the wider world of file types and applications. As I've said many times: for 60+ years, the world of software languages and applications has used very limited character sets for identifier names so that punctuation could be used for other things (e.g., . for parent.child, - for negation and subtraction, and [] for dimensions). You're going against an ocean of precedent.

And with your solution of sanitized names, you're neutering what you wanted: new options for variable and attribute names. You'll have the CF docs saying things like "although the preferred form is e.g., title-fr, in some applications this will appear as title_fr." If e.g., title-fr is going to end up as title_fr in some places, why not just use title_fr all of the time?

Yes, it is unfortunate that we're discussing a couple of proposals simultaneously (allowing .-[] in names vs allowing any Unicode character), but @turnbullerin started this thread by mentioning the full Unicode option and then changed to adding a limited set of punctuation, then changed to allowing .-[] (sorry if I got that a little wrong). And even you have mentioned different proposals are in the works. But to me, any proposal to support punctuation in identifiers is a bad idea.

In the spirit of CF's requirement that people give examples of why a change is really necessary. it would be really nice if you gave more examples showing why these changes are really needed so we can debate those one at a time. You're just jumping to the changes needed without giving the reasons (other than requesting e.g., title-fr, but you could use title_fr so that is to me not a good reason).

Finally, I don't know what the developers of hdf5 were thinking when they allowed full Unicode in variable and attribute names (which nc4 developers then utilized), but it could easily be that they were simply future-proofing their new file format. To me, that is very reasonable. I probably would have done the same. Then other groups (e.g. CF) can choose which characters are allowed in general for their domain and which have special meanings (e.g., / is disallowed because it is used to separate parent/subgroups). It was easy for hdf5/nc4 developers to support full Unicode since hdf5/nc4, when viewed by themselves, are a self-contained worlds. The developers didn't concern themselves with the downstream effects of different characters in different client software because it wasn't their problem. But it is CF's problem. [I know they are focused largely on nc4, but they should also be focused on the languages and applications in which those files will be read.] And it is ERDDAP's problem. So I don't think allowing .-[], or punctuation in general, or full Unicode, is a good idea because of the problems in the wider world of software languages, file types, and applications.

Best wishes.

Dave-Allured commented 1 year ago

@BobSimons, thank you for your detailed response. I think a new switch to request original netCDF names would not be messy. I do not know ERDDAP, so that is only my off-the-cuff opinion. We can disagree about that. I will try to stop talking about a wide character set now.

My preferred localization requires two new characters, ASCII only; period and hyphen, such as title.fr-CA. I consider this optimal for future purposes; as in, better than e.g. title_fr_CA. You have a point in your earlier comment that this is a seemingly tiny change from other strategies which would be fully compatible and effective. Nonetheless, would it be easy and non-messy for ERDDAP to provide a limited switch for attribute names only, such that knowledgeable applications may request that those two characters be preserved?

BobSimons commented 1 year ago

You're asking if something is possible. I want to answer "is it a good idea?"

Again, you ask if changes to ERDDAP are possible, but you ignore the downstream effects (other than the messy solution of offering a switch that knowledgeable users can used to request sanitized attribute names), which is my big point.

I think that in R and Python, when a datafile is read in, attribute names can be represented as e.g., datasetID.variableName.attributeName . In that case, the '.' and '-' become trouble because '.' is used to indicate parent.child relationships and '-' is used for negation and subtraction. Maybe you can use datasetID.variableName."attributeNameWith.And-" but I don't know. And what about all the other analysis programs that you and I don't know about and don't even know they exist? Simple character sets avoid trouble.

I think the idea of a switch to request sanitized names is a bad idea. It is counter to the norm of metadata appearing in a consistent way in different places. There is no standard DAP way to make this request (although ERDDAP could add one, e.g., &sanitizedAttributeNames ). And again, this presumes the user knows ahead of time which apps s/he is going to use the file in and the requirements of the app. And when ERDDAP reads data files from a dataset with a mix of unsanitized and sanitized att names, how is it supposed to know that the sanitized attributes should be unsanitized? But the bigger point is: you and I don't know all about all of the clients so we don't know where this will cause problems. And I'm pretty sure this will be a big annoyance for users who get snagged by invalid attribute names.

Partly, with things like this, it feels like inviting trouble. I'm smart enough to see there will be problems in various places, but I'm not smart enough (I think no one is) to foresee all of the consequences. It's a bad idea to make changes when you can't foresee the consequences.

All of this just seems like so much trouble (and we won't know how much until we do it and it can't be easily undone) for so little benefit. title_fr_CA is such a simple extension of the CF standard and will not cause any problems with any client file types or analysis programs.

I find it interesting that people treat CF, ERDDAP, etc as so malleable. (Well, in a sense, they are.) But if the program you wanted to change were, e.g., Python, R, Igor, ArcGIS, MS Excel, or Postgresql, you wouldn't think to ask for a significant change like adding punctuation characters to identifiers. You would just use title_fr_CA and be back at work in 1 minute. Instead you're asking for significant changes to ERDDAP and you are unconcerned (apparently) about possible problems (like users being confused about when they need to request sanitized att names).

And partly, I think this is a slippery slope. If you get one or two punctuation characters approved, the requests for other characters will be easier. But again, they will cause problems in various places and you and I would be able to predict all of those consequences.

BobSimons commented 1 year ago

Again, I worked hard to expand attribute strings so they could be Unicode as much as possible. That was the important change.

BobSimons commented 1 year ago

Let me rephrase and emphasize one point: With you proposal to optionally sanitize attribute names, CF would have to say that both title.fr-CA and title_fr_CA are legal. That is a crazy situation (and an extra pain for software tasked with reading and interpreting DAP .das, .nc, .nccsv, and other files) given that the standard could be for just title_fr_CA.

BobSimons commented 1 year ago

Here's another consequence of your proposal to allow .-[] in attribute names and have a switch to specify whether a request should return .-[] in the names or sanitize the names:

If the default for the switch is to return .-[] in the ERDDAP response data file, then it is likely that existing workflows of some users will suddenly fail because the .nc files they get from ERDDAP will suddenly have .-[] (when the data set adds title.fr-CA). I've tried really hard to not break/change existing behavior in ERDDAP partly because a big chunk of my time (and Roy's) was spent dealing with source datasets (I'm looking at you NCEI and NASA) where things changed (server moved, datasetID changed, directory changed, variable names changed, ...). It's bad when existing workflows break.
If the default is for the switch to sanitize .-[], then what was the point of having .-[] in the attribute names on ERDDAP web pages but not in the data file response? And we are back to the problem of CF having to say that both title.fr-CA and title_fr_CA are legal.

Dave-Allured commented 1 year ago

@BobSimons, I am all in favor of not breaking existing workflows. Here is how I think that could be achieved.

With optional character support added, ERDDAP default behavior should remain unchanged. Downstream traditional applications will continue to see only sanitized attribute names. Embellished names such as originally title.fr-CA will appear with "safe" names, with underscores only. Traditional applications will handle these generically as unrecognized extra attributes, and will not try to do anything special with them.

This requires that traditional applications will safely tolerate unrecognized attributes. Is that a safe assumption?
The switch or switches to enable optional characters must be implemented as optional syntax, such that default syntax of function calls and request URL's remains unchanged. In other words, implement as an optional function argument, or a new function name, or an optional keyword in a URL. Is an optional argument or a new function name or a new keyword possible in ERDDAP?
Data creators must continue to provide the expected traditional, unembellished attribute names in their netCDF files, at least during a migration interval. In this way, traditional applications will continue to see e.g. plain title, even in the presence of multiple title.XYZ. They will not know the difference, and will keep working normally. No extra intelligence within ERDDAP would be needed to support this.

Dave-Allured commented 1 year ago

Um, I am afraid I have made this more complicated than necessary. Please ignore my last post. What would happen if there were no new switches? What would happen if a traditional, unmodified application received a data variable through ERDDAP, and there were two attributes available, only one including unsanitized characters? Such as title and title.fr-CA?

BobSimons commented 1 year ago

My response to just your latest email:

When ERDDAP loads a dataset, it uses addAttributes to modify/add to the sourceAttributes, in order to make combined attributes. Currently, if an attribute name has .-[], that is not allowed and ERDDAP will throw an error and not load the dataset. Under your proposal, those chars would be allowed and the dataset would load.

Your proposal needs the switch because users need to be able to specify if they want e.g., a .nc file with .-[] (for applications that allow/handle that) or with sanitized names (for applications that don't). And then need to know what they need (a big mess). I think it is better to avoid changes to CF, nc3(?), and ERDDAP (and perhaps the DAP responses used internally) because the mess/confusion just propagates out from that. It is better to avoid the mess and use characters that are already allowed, e.g., title_fr_CA.

Dave-Allured commented 1 year ago

@BobSimons, that was not my question. I am trying to address only your issue of protecting current workflows. What would happen in a current workflow, changing only the sanitizer, if a traditional, unmodified application was to receive a data variable through ERDDAP, and there were two attributes available, only one including unsanitized characters? Such as title and title.fr-CA?

turnbullerin commented 1 year ago

@Dave-Allured I think this depends on which pathway that application is taking to get data from ERDDAP. ERDDAP supports multiple input and multiple output formats, of which NetCDF is only one option (albeit a major one especially for input files).

For example, DAP2 supports URL-encoded US ASCII characters in attributes and variable names. An unmodified application making a DAP2 request to ERDDAP should receive the attribute title.fr-CA but with URL-encoding like title%2Efr%45CA. A good DAP library will decode it to title.fr-CA since this is a legal DAP2 attribute name. I haven't looked at the ERDDAP internals to see if the encoding is actually being done (since it isn't necessary given the current restrictions) but the spec for DAP2 specifies this is legal. If ERDDAP doesn't encode it properly, I would suspect it would cause a parse error for the request on the client side since title.fr-CA is not a valid attribute name without the encoding (the . in particular is illegal). Of note, some characters are illegal in DAP2 - it only supports US-ASCII. DAP4 is bringing in full Unicode support for attribute and variable names.

For some formats, I would expect it not to be an issue like all the NetCDF output formats. There again might be an encoding issue though - I haven't used the Java NetCDF libraries, but the Python ones have two methods of setting attributes - directly as attributes or via setattr(). Only setattr() supports attributes that don't follow Python variable naming conventions. However, that seems unlikely in Java since it's not as much of a pattern in Java to do things like dynamically set new attributes on the fly. NetCDF supports full Unicode attribute and variable names as of v3 I think.

Other formats like CSV don't have the attributes embedded and can support a wide variety of unicode characters for variable names (which are basically column headers in CSV and the like). That said, there are some exceptions still that need to be handled well - CSV variable names would need proper escaping for commas and double-quotes for instance. These are relatively easy to fix but do require a code change. Downloading a file without escaping would cause corruption (title.en-FR would be fine but title,en-FR would mess up everything and cause major data corruption).

For some other formats, I would expect it to be a major issue. For example, downloading the data in MATLAB basically exports the data using MATLAB variables (in binary MAT-File format). All the variables at least (and possibly attributes, not sure if the metadata is exported) need to be a MATLAB variable name. Likewise, ESRI files have naming restrictions on variable names that are similar to MATLAB. These output formats would require an algorithm to rewrite variable and attribute names to acceptable variants (but also avoid duplication). These files won't even load unless this is done.

BobSimons commented 1 year ago

I think I'm done talking about this. My final statement is:

I think it is very important to keep ERDDAP simple to use. This proposal for allowing additional punctuation characters in attribute names (e.g., title.fr-CA) makes it significantly more complicated for every ERDDAP user forever more. Plus, there is a simple alternative solution (e.g., title_fr_CA) that gives you what you want (a way to make localized variants of attributes) and which requires no changes to CF's attribute name requirements, or ERDDAP's code, or user's behavior and thus avoids all of the complexity of your proposal. The complication that this proposal imposes on all ERDDAP users is: users would have to know ahead of time (when they request a file from a dataset from ERDDAP) where the file they are requesting from ERDDAP is going to be used (different analysis software or languages) so that they could sometimes add a switch to the request telling ERDDAP to sanitize the attribute names (e.g., from title.fr-CA to title_fr_CA) if the file type includes attributes (e.g., .nc does but the ESRI files don't), and if the destination analysis software (e.g., R, Python, Matlab, Igor, IDV, IDL, ArcGIS, Ocean Data Viewer, and other programs you and I have never heard of) doesn't allow those particular special characters. Plus, the switch described above doesn't solve the problem for users who are downloading the original source files from ERDDAP's /files/ system. In that case, it would be up to those users to determine if there were any attribute names with special characters that would be trouble in the analysis software they are using so that they could sanitize those attribute names manually (how?) before importing the file(s) into the software. That is way too heavy of a burden to put on users who just want an easy way to get data from ERDDAP into their analysis software, especially when there is such an easy alternative solution (e.g, title_fr_CA) which avoids all of this complexity.

For some people, these additional characters won't ever be a problem (i.e., if they use a file type + analysis software combination that is able to handle the additional characters). It is very common for these people to say: "pffff, this isn't a problem" or to dismiss the problem as not being significant. But just because it isn't a problem for them doesn't mean it isn't a significant problem for some/many other users. When making decisions like this, standards groups like CF and creators of intermediary software like ERDDAP should take the needs of all users into account.

So I think it would be a very bad idea for CF, ACDD, and/or ERDDAP to allow additional punctuation characters in attribute (or variable) names.

I'll add that the decision to allow such characters (.-[]) in attribute names is very different for the creators of neutral file types like .nc4/hdf5 which are not tied to a specific analysis program or standard (like CF). They can and should look at the file type in isolation. By allowing more (or almost all) characters in attribute names, they are future-proofing their file type, and allowing the users of the file type (e.g., standards groups like CF that specify attribute name conventions for their community) to decide which characters their community will allow in those files given where and how those files will be used. So it makes sense for nc4/hd5 to allow almost any character, but it doesn't make sense for standards groups or intermediary software like ERDDAP to allow characters that will be troublesome in their community.

It has often been said that Steve Jobs' brilliance as the person with the final say about product design at Apple was his judicious use of the word "no" in response to requests and suggestions for features that added unnecessary complexity to the products. The result was simple, easy to use products that users enjoyed using. I'm no Steve Jobs, but I think it is a good idea to try to follow the same path. This is a case where I think "no" is the correct answer.

BobSimons commented 1 year ago

And to be clear: I would support a CF proposal which will allow for the localization of specific attributes (notably, title and summary) by allowing optional language-specific variants of those attribute names (e.g., title_fr and title_fr_CA) provided that the original, default version of the attribute (e.g., title) was also present. It would also probably be fine to allow this for any attribute, not just specific attributes.

If such a proposal were ratified by CF and/or ACDD, then

when a user requested a specific language for a web page (e.g., fr) via the existing system for that, ERDDAP could look in the metadata for (e.g., title_fr or title_fr_CA) and use that information on the web page. Specifically, the methods EDD.title() could be changed to EDD.title(String language) and EDD.summary() could be changed to EDD.summary(String language). This is a relatively safe change because if ERDDAP developers initially miss a situation that could use the new information, it is not a serious problem and is a fixable problem.
when a user requests a file of a type that supports attributes (e.g., .nc) then ERDDAP would include all of the variants of e.g., title in the response file. Those new attribute names would not conflict with any analysis software or computer language where the file might be used, but would convey additional useful information.

Thus, this change to CF and/or ACDD,

would not require any changes to existing data files, but would allow optional changes.
would encourage some useful changes (not hard) to software like ERDDAP that could use the new information
would not require any changes to user behavior (e.g., whether to use a switch to sanitize attribute names)
would not lead to some files having, e.g., title.fr-CA and some files having, e.g., title_fr_CA
would not cause any problems when the files are read into analysis software

Overall, it is a simple solution to a useful and reasonable request for localized versions of metadata.

Dave-Allured commented 1 year ago

@BobSimons, I understand your frustration with this conversation. You are not the only one. Let us just agree to disagree. Thank you for your ideas and detailed replies. From this, starting with nothing, I have learned some essential things about ERDDAP.

I have a couple ideas that I would like to work on. Would you mind pointing me to the sanitizer code for netCDF names? I skimmed the code and saw more than one sanitizer for different purposes. The one for netCDF or attribute names was not obvious to me.

BobSimons commented 1 year ago

Yes. Agree to disagree.

Note: If you try to load a dataset with an invalid attribute name, ERDDAP will throw an exception and not load the dataset.

Currently, sanitation is done by generateDatasetsXml, so the new name is generateDatasetXml's recommendation that can be edited by the administrator before use. All EDD subclasses use EDD.makeReadyToUseAddVarableAttributesForDatasetsXml() and specifically the line 7261 (in the code I have) safeSN = String2.modifyToBeVariableNameSafe(safeSN); So that does the conversion. Note that that method doesn't guarantee that 2 different input names won't generate the same output name. That is appropriate for generateDatasetsXml but not for on-the-fly sanitizing of names while writing an output file, where you would probably want to ensure (at least try hard) that every unique input string leads to a unique output string. For that, see String2.encodeVariableNameSafe(), but that makes ugly names.

I hope that helps.

On Thu, Nov 2, 2023 at 5:06 PM Dave Allured @.***> wrote:

@BobSimons https://github.com/BobSimons, I understand your frustration with this conversation. You are not the only one. Let us just agree to disagree. Thank you for your ideas and detailed replies. From this, starting with nothing, I have learned some essential things about ERDDAP.

I have a couple ideas that I would like to work on. Would you mind pointing me to the sanitizer code for netCDF names? I skimmed the code and saw more than one sanitizer for different purposes. The one for netCDF or attribute names was not obvious to me.

— Reply to this email directly, view it on GitHub https://github.com/ERDDAP/erddap/issues/114#issuecomment-1791538094, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALKWOCIDU2MEWE5VBARYELYCQDO7AVCNFSM6AAAAAA5UI6FJOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJRGUZTQMBZGQ . You are receiving this because you were mentioned.Message ID: @.***>

rmendels commented 1 year ago

@Dave-Allured @turnbullerin @BobSimons Is there any chance of having a test netcdf file(s) made with the most extreme case(s) of what is being proposed? I mean think about the uses that would most likely break things and put that in the file, not the easy examples. That would help a lot (not that I really have time for this).

Also in the CF discussion, there were some comments about restrictions coming from ERDDAP. The restrictions in ERDDAP come from trying to following certain standards (DAP-2, CF, ACDD), as well as restrictions that arise from the most likely client software that people use in their workflows, not as has been implied ones that we have self-imposed.

And finally, I think part of our frustration is if some of this is made a requirement, rather than allowed, and for us to be 100% CF complaint, that would entail a lot of work on our part (and likely other programmers as well). Just keeping all libraries up to date, and testing those changes is a lot of work, plus we have a long list of new features requested by users.

BobSimons commented 1 year ago

I would phrase some of what Roy said differently: ERDDAP's character limitations are self-imposed by ERDDAP, but only because similar limitations are imposed by most standards, analysis software, and file types. By choosing a limited character set in ERDDAP, we know that there won't be problems in almost any analysis software or file type (an exception is old Matlab files being limited to 32 character identifiers and some others being limited to 255 characters). The proposal to allow e.g., .-[] breaks this system which has worked so well. And the subsequent need for a switch to sometimes sanitize the names puts way too much burden on users ("when do I need to sanitize the names???").

As for the sample file, I don't want to pursue this and Roy isn't actually offering to do the work to pursue this. Besides, the whole point of the current system is: we can be confident ahead of time that it will work in all analysis software, even ones we don't know about or that haven't been created yet.

As for the effort to change ERDDAP: yes, it's an effort to change ERDDAP, but that's not why I'm against this. This would probably not be a big project. I just don't think it's a good idea.

Dave-Allured commented 1 year ago

@turnbullerin said:

... For example, downloading the data in MATLAB basically exports the data using MATLAB variables ...

Erin, variable names are more problematical than attribute names. Is it okay if we confine this discussion to localization of attribute names only? I recall that your original proposal says "metadata". To me this means attributes, not necessarily data variables.

turnbullerin commented 1 year ago

@Dave-Allured Just to clarify, I want to make the separation of topics clear here:

Topic 1 is localizing the content of variables and attributes. This would generally work by introducing a suffix for each locale that is used to distinguish it from the content in the "default" locale. For example, providing an English and French title or both English and French descriptions of current weather conditions. This is the focus of what I've been working on with the CF folks and does not require ERDDAP to expand its allowed character set. It could be storing it in attributes named title_en and title_fr for title or variables named weather_en and weather_fr. It is NOT part of Topic 1 that the names themselves be localized, only the content.

The only ask I have for ERDDAP in terms of Topic 1 is localizing the metadata presented on the HTML web pages (e.g. title, description, licensing, long names, etc.). This only requires changing how rendering a few pages work by being able to select the best available locale based on the language selection and then select the proper attribute for title, description, long_name, license and others based on whatever the CF team decides as a convention for localization (this continues the tradition of ERDDAP aligning with CF and ACDD in terms of how people specify attributes in the XML file).

Independently of my request, the CF team is discussing Topic 2 - expanding the allowed character list from the current restricted set to the full Unicode set. There are a few drivers there, such as being able to properly represent chemical species names. When introducing Topic 1 to them, they suggested it might be useful to use something other than the underscore to separate the locale from the actual name to make it clear that it isn't part of the name. However, this requires introducing new characters to the allowed character list with the main proposal being adding the period and hyphen (e.g. .lang-COUNTRY).

Personally, I'm not a fan of entangling Topic 1 and Topic 2. I've been pushing for a solution to Topic 1 that does not require expanding the character set (the leading proposal is letting users fully specify the suffix in the global attribute and so they can match whatever character restrictions are required by their solution and avoid confusion). This would let us focus on implementing it in ERDDAP without worrying about the Unicode extension.

Dave-Allured commented 1 year ago

@turnbullerin, I was careless in phrasing that question. I agree, "full Unicode" in netCDF names is not part of this discussion. I was referring only to the choice of syntax for modified attribute names, such as title_fr vs. title.fr-CA. Is it okay with you if we avoid modified variable names, and stick to attribute names only?

Dave-Allured commented 1 year ago

Is there any chance of having a test netcdf file(s) made with the most extreme case(s) of what is being proposed?

@rmendels, extreme case demo may be premature before a core syntax for language tags is decided. Here is a simple test file for my recommended syntax. There are many other schemes under consideration; see previous discussions. Simple files for other schemes may easily be constructed.

netcdf file6 {
variables:
    int var6 ;
        var6:title.en-US = "English Title" ;
        var6:title.fr-CA = "Titre française" ;
        var6:title.jp = "タイトル" ;
}

This is CDL, of course, for direct display. Make your own local netCDF copy with ncgen -o file6.nc cdl6. BTW, it should not matter if you make netcdf3 or netcdf4.

Dave-Allured commented 1 year ago

I like Bob's idea above, to include localization directly within ERDDAP, such that existing ERDDAP web applications will benefit with few simple changes, or maybe even none. Add localization in a single place, not hundreds. Without having studied the code, I also believe that this internal approach would work well for any one of the syntax options being considered, including my recommended .lang-country.

turnbullerin commented 5 months ago

To follow-up on this, I've put together a draft update to the CF conventions to support localized metadata here:

https://github.com/cf-convention/discuss/issues/244#issuecomment-2211058927

The draft would allow the creator of the NetCDF file to determine the localized attribute names by way of the localizations global attribute that maps attribute suffixes to IETF language tags.

For the NetCDF files output by ERDDAP, I would expect this to need little to no changes as long as the suffixes in the original NetCDF file (or the datasets.xml configuration for a non-NetCDF dataset) follow the conventions for naming attributes in ERDDAP.

For ERDDAP to leverage the localized metadata in the HTML pages associated with a dataset (such as in the tabledap/griddap list page, the dataset page itself, the build-a-graph, etc.), it would then require comparing the user's locale to the available locales for the dataset, selecting the most appropriate ones, then checking if content exists for the attribute being output. I would propose this only for places where it makes sense - the list of global attributes does not, the title of the dataset in the list of datasets makes more sense.

I'd note that this would also enable ERDDAP to meet WCAG criterion 2.1 Language of Parts (an AA requirement, which is therefore part of Section 508 compliance as well as the requirements for the Canadian government) which it currently does not when the dataset content is not in the language of the page. By adding the lang tag with the appropriate language tag to the content that is used (even if it is just the default content used), then 2.1 is met. Once that change is in place, compliance would be a matter of adding the correct attribute to datasets.

ChrisJohnNOAA commented 5 months ago

At a quick glance I think that sounds reasonable. There's always the possibility actual implementation will be more complicated than expected though and I haven't done a deep dive into all of the related areas of code to check.

Should we retitle this issue to focus on localized metadata?

rmendels commented 5 months ago

@turnbullerin @ChrisJohnNOAA

Hi Erin:

First thanks for your work on all of this, I am of the CF list also. I realize how important this is to your work in Canada, so take any of my comments in that light.

I have only skimmed the proposals (both here and at CF), but all along concerns Bob and I have expressed have not only to do with implementation in ERDDAP, but the effect on workflows that people have established in a variety of languages. My gut reaction, though perhaps wrong, is this could break some clients. If implemented, I would certainly want it set up in a test ERDDAP to see before such changes are made general (in other words this is not the sort of thing tested by the ERDDAP development tests). Also, I may be dense (I am old and slow), but I don't see how, if a person is coming in from Canada say, we determine if it should be in French or English based on their location. Also a lot of requests these days come in through centralized things like Cloudflare or Akamai (or AWS, Azure or Google Cloud), which do not show the true location of the client. As I said, these are just initial reactions.

Second, in most cases I leave it up to Chris to allocate his time and what projects to work on, and similarly here. But remember Chris is only part time. The surest way to get something like this into ERDDAP is to implement the changes and make a pull request. Chris, working with contributions from several others (and many thanks to those people), has made it even easier to set up a development environment using Jetty, and testing has been much simplified. An example of how contributing really helps not just us but all users, people wanted improved logging and for the results to be visible in ERDDAP (see https://github.com/ERDDAP/erddap/discussions/162#discussioncomment-9969576). Several of the people who wanted this contributed code to move it along (which again is much appreciated). Something to consider.

turnbullerin commented 5 months ago

@rmendels Thanks for your feedback! As a former Java developer, it's something I or the other people on my team could look at helping implement.

In terms of implementation, I am more thinking it should be aligned with the language picker on ERDDAP's HTML pages. It's actually a significant web accessibility issue here at the moment that might be a showstopper for ERDDAP within the Government of Canada that, when I change the language to French, the metadata for the dataset is all in English still but that content is tagged as being in French by the HTML code (which means a screenreader will try to read it as if it was French... which it is not). That's the AA WCAG violation I mentioned, which I think NOAA would have to follow as well. Fixing that requires knowing the language of the content, since if I provided it in French, then it's actually fine (but it will be wrong on the English page and all the other languages that the ERDDAP web interface supports).

For downloading a file or delivering via the DAP protocol, I would not expect ERDDAP to translate the attributes, only deliver them as is. This is more to support people who are browsing an ERDDAP server for data or using the data access form and would like to do so in their native language (it also lets data catalogues provide multilingual metadata from the ERDDAP record).

That said, if you wanted to support automatically picking the proper language for ERDDAP based on the language of the user, you'd want to look at the Accept-Language header - this is a standard HTTP header sent by browsers with the language choices of the user (it is why BCP 47 was written originally and why we picked it, so that it is compatible with how languages are typically handled in web applications) and it should pass through any proxy or cloud servers (at least it does on my other cloud applications). This is not a necessary feature in my opinion though - I actually just provide a link to the English ERDDAP pages from my English pages and to the French version from my French pages and that works. The necessary part is addressing the WCAG issue and being able to provide French users with French content about the dataset on the HTML pages about that dataset.