Open mobb opened 1 year ago
Thanks for this report, @mobb . While I agree we don't explicitly name a field for this LOCALE information, it is standard practice (e.g., in database systems and operating systems, and in MIME types) to include LOCALE info in with the character encoding for text files. For example, here's a table IBM maintains with these values: https://www.ibm.com/docs/en/aix/7.2?topic=globalization-supported-languages-locales The standard syntax for it is LOCALE.ENCODING, where both the values for LOCALE and ENCODING come from the standard vocabularies maintained in the Unicode Common Locale Data Repository and ISO encoding values, with the specific list of supported locale values maintained in github: https://github.com/unicode-org/cldr/tree/main/common/main. For example, to indicate British, Canadian, and US locales for UTF-8 encoded files, the character encoding would be set to EN_GB.UTF-8
, EN_CA.UTF-8
, and EN_US.UTF-8
, respectively. The locale rules include defaults for many things, including default currency and decimal separators, data ordering, etc. Here's how I think it would be represented in EML:
...
<physical>
...
<characterEncoding>EN_GB.UTF-8</characterEncoding>
</physical>
...
We frequently omit characterEncoding
from our metadata, but I wish we did a better job of including it as it is critical info for interpreting a text file.
LTER IMs and EDI are currently updating EML best practices. If including <characterEncoding>
satisfies the OP, should we recommend always using <characterEncoding>
in the EML best practices? Could all U.S. LTER sites reasonably assume their text files use EN_US.UTF-8? If not, I think we'd want a way to determine the encoding before we recommend that folks always include it.
I think that would be a great addition to the best practices to always use <characterEncoding>
.
Within the US, I think assuming EN_US
will usually be correct (except in cases like @mobb's where the data originate elsewhere), but we certainly encounter quite a few other character encodings than UTF-8. We run into ISO-8859-1 (Latin-1) and Windows CP-1252 a lot, and we often run into trouble with encoding mixes where the base will be UTF-8 but stray characters from other encodings have been cut-and-pasted in incorrectly. This has caused us no end of grief (I just got a long rant about this about one of our datasets just a few weeks ago). So, we've been pretty diligently trying to move everything we do to UTF-8. Not always easy though. I'm not sure if UTF-8 is a safe assumption, but its a decent starting point I suppose. @jeanetteclark what do you think of adding a MetaDIG check for this to the FAIR suite?
@mbjones I looked at a sample list of encodings. I didn't see EN_US on there. ASCII and UTF-8 are on there, and those two are also listed as examples in the EML spec. Would we satisfactorily alleviate headaches if we recommend UTF-8 in the EML best practice document we are authoring?
Here's the best practice text I'm proposing.
The physical tree (/eml:eml/dataset/[entity]/physical) further describes the physical format of the data. Within physical, we recommend populating the characterEncoding element if you can determine the encoding. For most U.S. data, an encoding of UTF-8 is typically correct, with ASCII being another typical encoding. Whatever you choose, if you do provide an encoding, please be sure it is not an incorrect one, e.g., do not choose ASCII if your data include extended Latin characters.
Hey @twhiteaker -- yeah, that list appears to be a list of character encodings, and not locales. Both are needed to properly interpret a file. Most computing systems assume the LOCALE of the local computer applies unless otherwise specified. On Mac and linux machines, you can often use the locale
command to see what your local settings are (which determines how dates/times/text files, etc. will be interpreted). For example, on my Mac:
❯ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
Note that the LANG
environment variable uses the "locale.encoding" format that I am proposing be the standard in EML, such as en_US.UTF-8
. So, if I open a file that uses en_GB.UTF-8
then I will likely incorrectly interpret the decimal separators and other date and time issues as if they were using the en_US
standards, unless I tell my software that the file is structured differently (e.g., like how you tell vim to switch language locales).
Nevertheless, I think people interpreting the EML characterEncoding
should be prepared to see both values like en_US.UTF-8
and just character encodings like UTF-8
, which I think should mean to assume the current LOCALE of the computer.
We have encountered data files from Europe where a comma is used as the decimal separator (rather than a period which is common in the US). We have not found a place in the EML schema to record this, so this issue records that request. Commas are common in other parts of the world. https://i.redd.it/omgfapht3qn51.png
A comma decimal separator is not always correctly interpreted automatically by packages (e.g., pandas), although most have a mechanism for specifying this in the import statement (e.g.,
pd.read_csv(file_name,sep=';', decimal=","
). EML metadata can be used to aid importing data tables, and so could populate that statement. Most likely, an optional field nameddecimalSeparator
would suffice.We agree that it would be almost impossible to interpret a table that used commas as both the field separator and the decimal separator without differentiating them somehow. Therefore, its likely that a best practice would be to not construct a table this way. We have not explored the effect of using the literalCharacter field, for example ‘
2021-03-28; 20\,27
’.