Closed dmk42 closed 5 years ago
I would prefer that we continue to allow multiple encodings. Although UTF-8 is dominant, the additional flexibility avoids forcing the majority case on everyone, and does not really cost us anything.
I do believe that it is sufficient to select the character set once at program start-up, though. Selecting the character set on a per-string basis adds development cost and substantial complexity for little benefit.
I think we should only support UTF-8 strings. I don't think it should be possible to turn off UTF-8 support or to use C environment variables to select a different character set. I think if other character sets are desired, a Chapel programmer should use libraries (like iconv) to encode differently.
I don't feel strongly about supporting different multibyte string representations. If we do choose to continue to go that route, I would be uneasy with having this representation determined on a per program basis but would be willing to wait until such a feature is requested by an actual user.
8% of web usage seems too large to ignore, especially if it continues not to require much effort on our part. I think we should continue to support multiple encodings. But only supporting one encoding per program execution and selecting that when the program starts seems reasonable.
I always worry when aspects of program behavior change based on things that are not visible within the source text. This makes me nervous about having the string encoding be inferred from the user's environment at program start-up. But I also admit that changing character locales is something that I've barely understood or have had to wrestle with in my lifetime (and when I have had to, it's made me miserable). My intuition is that if I sent you a Chapel program that worked for me in my UTF-8 environment, it might not work for you if your character locale is different. And that this would make one or both of us miserable rather than happy. Is that incorrect? And/or in what use cases / scenarios would we be happy that it was sensitive to the current character locale rather than unhappy?
[Implicit in my response here is an assumption that the main role of string encodings in Chapel is less about console I/O and more about consumption and generation of string data in files, on the web, or wherever it lives as natively as possible]
I'd be comfortable if Chapel were UTF-8-only for the time being (I realize that @dmk42's saying that it's more than that at present), but my ideal language would definitely support the ability to request encodings on a per-string basis. For example, imagine:
var myString: string; // defaults to UTF-8 string
var bytestring: string(encoding.bytes);
var asciistring: string(encoding.ascii);
var utf8string: string(encoding.utf8);
var ucs4string: string(encoding.ucs4);
For that reason, it's more important to me that we take an approach that would not preclude multiple encodings in the future than it is to support multiple encodings at present. (That said, I think the CLBG benchmarks could make good use of byte strings if we had them today—have I got that right @benharsh and @ronawho?).
My reason for wanting multiple encodings is partially for the "access the data wherever it lives as natively as possible" principle above, and in part to give users the ability to make precision vs. efficiency tradeoffs ("I'm only using ascii strings, and want all operations to be fast; I'm using unicode, but want to optimize for space rather than speed; I'm using unicode and am willing to sacrifice space to get speed").
Related, I've spoken with users who have expressed interest in Chapel's string type being extensible such that by implementing a set of core routines, additional encodings could be added without significant compiler / language / runtime changes. I think of this as being like a domain map for strings, separating implementation details from high-level operations. Whether or not you believe in the user extensibility theme, it also suggests a desire for multiple encodings.
I always worry when aspects of program behavior change based on things that are not visible within the source text. This makes me nervous about having the string encoding be inferred from the user's environment at program start-up. But I also admit that changing character locales is something that I've barely understood or have had to wrestle with in my lifetime (and when I have had to, it's made me miserable). My intuition is that if I sent you a Chapel program that worked for me in my UTF-8 environment, it might not work for you if your character locale is different. And that this would make one or both of us miserable rather than happy.
+1
In a sense, this issue is a red herring. It is not really a language issue. It is how software in general works on POSIX. Imagine if you could not get your favorite POSIX application to support your character set because it happened to be written in a language that explicitly ignored the character locale environment variables.
I don't see it as a good idea for any language to ignore the POSIX convention for selecting character locales, because it then makes users scramble to learn all the different ways that they can (and can't) select what they need.
Here are some interesting resources:
This one advocates for UTF-8 everywhere.
This one discusses whether bytes and UTF-8 need separate types (among other things) - they are the same in PHP:
http://kunststube.net/encoding/
This one describes unicode support in many languages:
https://unicodebook.readthedocs.io/programming_languages.html
I think one facet of this question is whether or not a string
-like type storing non-character data (e.g. bytes
) should exist or if such data can just be stored in a string
.
Different languages do different things here. The reference I've recently been appreciating though would request that we have a different type for non-textual data so that string
is always UTF-8. http://utf8everywhere.org/#faq.liberal
This online resource directly addresses this question:
https://www.cl.cam.ac.uk/~mgk25/unicode.html#activate
How should the UTF-8 mode be activated? If your application is soft converted and does not use the standard locale-dependent C multibyte routines ... to convert everything into wchar_t for processing, then it might have to find out in some way, whether it is supposed to assume that the text data it handles is in some 8-bit encoding (like ISO 8859-1, where 1 byte = 1 character) or UTF-8. Once everyone uses only UTF-8, you can just make it the default, but until then both the classical 8-bit sets and UTF-8 may still have to be supported.
I think the question here is, are we close enough to "Once everyone uses UTF-8" that Chapel can just work with UTF-8 and ignore the POSIX LOCALE.
I don't think 8% is low enough for us to claim UTF-8 has taken over. There was a time when roughly that percentage was non-Windows, and we certainly haven't lost all non-Windows OSes.
I deliberately chose the web for my example because it has the lowest percentage of non-UTF-8 usage. Outside of web sites, there is less uniformity. Also, if you happen to be a user who cannot currently move off of ISO 8859-15 Latin characters, or the Shift-JIS Japanese character set, for a couple of common examples, it is 100% for you.
@dmk42 - Chapel is not for creating web services or Windows GUI applications. It only runs on linux or linux-like OSes. I think a more relevant question than the number of web pages that are UTF-8 is - how many OSes targeted by Chapel do not have sufficient UTF-8 support to use a UTF-8 locale?
But, once we know the answer to that question, there is naturally a second question. Do we want Chapel and UTF-8 in Chapel to work on a linux-like OS even if it lacks proper UTF-8 support? Or, do we want to say something like "UTF-8 support in Chapel cannot be relied upon because it depends on the end user environment"?
I'm think Chapel's UTF-8 support to be something that Chapel programmers can rely on, no matter where their programs run. This is the reason that I don't think we should enable/disable UTF-8 support based on the environment. I think it would be reasonable to simply emit an error/warning if the C locale was not harmonious with UTF-8. Further, I believe OS support for UTF-8 is sufficiently good that we could simply declare running Chapel with a non-UTF-8 locale is simply an "unsupported configuration".
Chapel is not for web services, and it is web services that have the highest UTF-8 usage.
I disagree that the question is about whether or not there is OS support. It is about what character sets users use.
@dmk42 - What does a "user using a character set" mean? Does it mean setting LOCALE environment variables? Or is it what character set the terminal supports? Or does it mean having files that they wish to process in that character set? Or something else?
To say that a given user only "uses" one character set is overly simplistic. Similarly I think that the LOCALE system for setting environment variables is overly simplistic. What if you want to process 10 files and one is in Shift-JIS and the rest are in UTF-16? The LOCALE environment variable is not that helpful in this setting, because you couldn't easily handle the different encodings within one program. However I don't think that the Chapel standard library needs to worry about multiple character set encodings. It simply needs to use a consistent encoding so that other modules or other programs can be created or used to translate between Chapel's chosen encoding and other encodings.
But, in the context of Chapel programs, the closest thing I've written to a program (that wasn't just a test) that genuinely cared about character sets is the program performing graph analysis on Twitter data. That data set happens to use UTF-8. If it were using Big-5 or Shift-JIS, I'd expect that my approach would be to convert the input data to a consistent encoding as a preprocessing step, using existing tools (like iconv) that are tested and documented. I see no reason why that consistent encoding can't be UTF-8 for Chapel.
Here's another angle on my viewpoint for you. Suppose that we "support" non-UTF-8 settings of the LOCALE environment variables. How do we test that to be sure it actually functions? How many LOCALE environment variable settings to we need to test in order to make sure the I/O code functions correctly?
Lastly, there is the matter of string literals. I think that Chapel programs should always be in UTF-8 (because it makes sense to choose just one encoding for source files). That implies that string literals will be in UTF-8. But then, what would happen if the LOCALE environment variable is followed and it's incompatible with UTF-8? We can even ignore the terminal and just imagine the Chapel program is outputting string literals to a file. It seems to me that either
iconv
and know how to translate between arbitrary character encodings.I don't like either of these options - 1 would be surprising to users, I think, and 2 seems too much work on the implementation side.
To try and move this conversation forward, I'm intending to summarize the disagreement as I understand it for those who may not have been following (like myself). I timed out before getting to this today, though.
Answering my own questions to the best of my ability for those who were also wondering about them:
My intuition is that if I sent you a Chapel program that worked for me in my UTF-8 environment, it might not work for you if your character locale is different. And that this would make one or both of us miserable rather than happy. Is that incorrect?
As I understand it, this is accurate. If a Chapel program was written to output a UTF-8 string to the console (say), and a user were to run it from a terminal session whose character locale isn't compatible with UTF-8, the bytes that would be emitted by the program will be interpreted differently by their terminal and my intended "Hello, world!" message would render completely arbitrarily and likely unintelligibly for them.
And/or in what use cases / scenarios would we be happy that it was sensitive to the current character locale rather than unhappy?
I think this is such an example, assuming I'm running it from a terminal whose locale is not UTF-8 compatible:
var s: string;
stdin.read(s);
s.toUpper();
stdout.write(s);
If Chapel assumed that all strings were UTF-8, the toUpper() routine would be operating on the string s
, interpreting its bytes as UTF-8 characters and doing the proper thing for them. But if the string type and routines all followed the terminal's locale, then the string we read in from the console (in whatever encoding) would be converted to uppercase in that same encoding (because the string operations would all follow the current locale's encoding and "do the right thing"), and then it would be printed out to the console in the same encoding, permitting the terminal to render it properly.
So the summary is that if Chapel were UTF-8 only, it would preclude users whose locales used incompatible encodings from writing programs that interacted with their terminal or other resources they had (files or web resources, say) that were not UTF-8.
Obvious symmetric follow-up Q that I didn't ask (well, I guess my first question almost did): In what cases would we be unhappy if the strings did follow the user's locale?
If a Chapel program was written that did not interact with the console at all, but was interacting with files or web resources that were in UTF-8 (say), then the strings that it reads from/writes to that resource would needlessly and incorrectly be assumed to be in the console's character encoding, not UTF-8. As a result, they could end up garbled, making the program behave incorrectly. Put another way, if I (Brad) wrote such a program in my UTF-8 environment and it operated on global UTF-8 resources and then you ran it in your UTF-8-incompatible character locale, you'd be unhappy. @dmk42 tells me that the typical way to handle such cases is to fire up a new terminal that is UTF-8-compatible for the purposes of running the program (i.e., abandon your default locale for the sake of running the program)
So the summary here is that if Chapel followed the encoding implied by the user's character locale, it would preclude them from interacting with resources in other encodings unless they were to fire up a terminal in that locale.
Assuming I've got all that right (and I'm not confident about that), at this point, my intuition continues to be that this suggests that the ideal language would support multiple encodings within a single program so that it could interact with the user's terminal in a natural way while also accessing external resources in whatever encoding they're in without being forced to convert between them.
It also makes me think that channels should have an encoding associated with them as well, where stdin/stdout would most naturally default to the one implied by the user's character locale; but others could be encoded in the way that made the most sense for the resource they are interacting with. In this way, I could mix "native character encoding" channels for stdin/stdout with UTF-8 (or whatever) encoding channels for those files / web resources that fell in the 92%. Then, presumably if I read/wrote a string in one encoding to a channel with a distinct and incompatible encoding, I'd either get an error and need to explicitly convert it to that channel's encoding to make it legal, or perhaps it would happen automatically in the language as a coercion.
The obvious problem with this proposal is that it's (presumably) a lot of work to implement and challenging to test / feel confident that it's correct. This is why (as stated above) it's more important to me that we take an approach that would admit multiple encodings in the future (i.e., wouldn't preclude them) rather than worry about it now.
So arguably the @mppf - @dmk42 debate could be viewed as "We both agree that Brad's approach is completely intractable at present (maybe ever), so assuming that we can only support a single encoding in a program execution, which single encoding should it be? UTF-8-only (Michael's stance) or 'whatever the current locale implies it should be'? (David's)"
I'll try to summarize those perspectives in the next comment to break this up a bit. But I'm timing out for now.
Here, I'm going to summarize (to the best of my ability) my understanding of the David and Michael positions on this subject for those who haven't kept up with the conversation.
[@mppf / @dmk42: Obviously, please let me know if I've gotten anything wrong or misunderstood (or missed) any of your major points above]
Each of them are arguing that Chapel strings should support just a single encoding. At a high level, Michael thinks that the encoding should be UTF-8 while David believes it should be the one implied by the POSIX locale setting in the environment from which the Chapel program was run.
Points of commonality:
David's position:
Michael's position:
David's concerns with Michael's position:
Michael's concerns with David's position:
Users running a given Chapel program from distinct POSIX locales could see surprising and inexplicable differences in behavior ("This program worked for me, what do you mean it isn't working for you?"). By contrast, if Chapel were UTF-8-only, it could head off surprises by issuing a warning/error when the program was run from an incompatible locale.
Elaborating on this one a little bit, imagine writing in Chapel an ID3 tag parser for reading information from MP3 files. The ID3 format includes textual data and supports 4 character encodings. See https://en.wikipedia.org/wiki/ID3#ID3v2 ... Let's suppose that the ID3 library would have a routine to return a song title. It's not necessarily the right answer but at least on the surface it seems that it would be natural to return the song title as a string. This could proceed in one of 3 ways:
Option 1 seems problematic because we'd like libraries to reduce the complexity of the programming task rather than increase it... chances are users of the ID3 library are less well equipped than the library author to handle translating between character sets.
Option 2 wouldn't make much sense if Chapel assumes strings match the current POSIX locale and that varies in normal practice. It is reasonable in the "UTF-8 only" design.
Option 3 is problematic because the author of the ID3 library would have to translate from the ID3 format to an arbitrary number of other character encodings. It seems that something like iconv
would have to be available within Chapel for this option to make sense.
@dmk42 and I had a discussion about this topic and we might find middle ground around declaring that UTF-8 is "the normal way" for Chapel programs to run. But each of us would have a different desire for exactly how something along those lines would need to be worded.
How many C / POSIX LOCALE character set encodings are there, actually, in practice?
It depends on how you count. On my Ubuntu system ls /usr/share/i18n/charmaps/ | wc
shows 233 character maps that the system can operate with. Of these, only 31 show up in /usr/share/i18n/SUPPORTED
.
Since my last summary, I've been chewing on this overall issue to figure out how to break the impasse and move it forward. I started from the following question (which perhaps ought to be its own issue): "What encoding should a Chapel string literal use?"
In coming here to post it and my resulting thoughts:
As a result of this last assumption, I will focus on the handling of a string that I, as the author of some Chapel code, intend to be UTF-8. Let's say I wrote the following program in a UTF-8 compatible environment (e.g., terminal and terminal-based editor):
var s = "Héllo, monde"; // French for "Hello world, of course!
handleString(s);
As the author of the program, I believe that I want the string's logical value to be "Héllo, monde" regardless of the character locale that another person uses to compile and run the code. For example, if handleString()
sent the string to some online website that accepted international messages of greeting in UTF-8, I wouldn't want the program to send a different interpretation of the message that I put into the source code to the website, even if someone compiled and ran it from within another locale (i.e., if the program doesn't interact with the console, the console's preferred character encoding shouldn't matter to how it behaves, in my opinion).
This suggests to me that some options for ensuring that this happens are:
1) Have Chapel string literals in the source always be interpreted as UTF-8 1a) Same as (1) but also optionally add a means of tagging a given string literal in order to have it be interpreted using another encoding (?))
2) Have some way of encoding the character set in which the code was developed into the code file itself (where I'd argue that the default, when unspecified, should be UTF-8). 2b) Same as (2), and arguably, perhaps good editors would insert the necessary encoding tags into the source file when it was running in "Chapel mode" and knew that its hosting character locale was not UTF-8 / not UTF-8-compatible).
3) Something else I haven't thought of?
[For what it's worth, I like 1 and maybe 1a the best of the above, reflecting a UTF-8 bias].
In any case, let's assume that Chapel has some way in which I can ensure that "Héllo, monde" is encoded as UTF-8 in my program above. If this is the case then for the sake of compiling the program, it shouldn't matter what host locale I'm working with—no matter what, I want to have some way to assert that the correct string above is "Héllo, monde"(utf8)
.
Next, let's consider the case when handleString()
just does writeln(s)
(writes the string to the console). This is where things start to get trickier because the console's character locale may or may not be able to print UTF-8 strings sensibly. Let's consider how this code would work in the various proposals:
Brad's "multiple encoding dreamworld": stdout
would be parameterized by the console's native character locale. The compiler would determine whether or not that locale can accept UTF-8 strings. If it can, the string is printed out and appears as expected. If it can't, the compiler checks to see whether it knows how to convert from utf8 to the console's encoding. If it doesn't, it emits an error. If it does know how to convert between them, it inserts a coercion to do so. If the conversion is successful, the result string in the console's native encoding is printed as governed by the conversion process, and the program works as intended (or close enough... maybe in an ASCII-only terminal, the é
would get printed as merely e
or perhaps e'
. If the conversion can't be done (say the console encoding doesn't have an equivalent of é
or the contributor who implemented the conversion failed to handle this case), an error would occur / get thrown. (Note that in this description, I've been vague about whether these errors are execution-time or compile-time. I touch a bit on this in the next paragraph).
[Serious question: I refer to this as a "dreamworld" because it seems the most logically consistent and sane to me, but nobody's raised any actual objections to it, maybe out of politeness or mercy. There's obviously a nontrivial implementation effort to achieve it, but is the model itself broken in some fundamental way that I'm not picking up on?)
Michael's "UTF-8-only draconian chainworld": If the console couldn't support UTF-8 strings, Chapel would detect this and refuse to either compile or run the program (giving a message like
Chapel's stdout.write() is only supported for UTF-8-compatible character locales
), depending on when the decision is made (i.e., is CHPL_HOST_CHARACTER_LOCALE
something inferred/set/known at compile-time or is it detected and reacted to at execution time? If the former, I'm guessing we'd generate an error at load-time if the execution-time environment didn't match the compile-time assumptions. If the latter then presumably we could only generate errors at execution-time).
David's "POSIX-focused garbleworld" (where I still may not have this right): Continuing to assume that I've somehow bound the string to be UTF-8 in my source, if I print it to a console that is compatible with UTF-8 strings, it will appear as expected. If I print it to a console that isn't, the raw bytes that represent my string in UTF-8 will be sent to the console and interpreted/printed there however that console's encoding renders those bytes. (I believe that this same rendering is how the string would look within the text of my program if one were to open the source in a terminal-based editor from that same locale, right?)
Then I've gone on to think about the opposite direction: Say we read in a string from the console:
var myString = stdin.read(string);
What would the encoding of myString
be for a console that doesn't use a UTF-8 encoding?
Brad's dreamworld: One of two things would happen depending on how much we wanted to use UTF-8 as a lingua franca in Chapel:
1) If we wanted to use UTF-8 most of the time, the reverse of the console writing case above would occur: We'd read in the string using its native encoding, then attempt to convert it to UTF-8 (generating an error if that conversion was unsupported, or if it could not handle the particular string it was being asked to convert).
2) If we wanted to have encodings propagate in a natural way, the type of myString would be string(encoding=xyz)
where xyz
is the encoding used by the terminal.
My personal preference here is option 2. Note that if I'd typed var myString: string = stdin.read(string);
, that might be a way to force the behavior in option 1, assuming that string
is a concrete shorthand for string(encoding=utf8)
rather than string(?)
).
Michael's chainworld: As in the previous example, the program would generate an error upon being compiled or run: "Chapel can't handle reading strings from your non-UTF-8 console... get with the modern times [link to UTF-8 everywhere website provided here]."
David's garbleworld: The program would read in the string and store the bytes that resulted from the read in myString
. As a result, myString
itself effectively has no encoding, so if utf8 functions are called on it, the bytes will be interpreted as utf8, if ascii functions are called on it, the bytes will be interpreted as ascii, etc. Ultimately, the string will be stored/printed somewhere (stdout, the web, a file), which will result in those bytes being sent to that location and being interpreted however the location naturally interprets them. If the user wanted to convert the bytes into utf8-compatible bytes, they'd presumably have to call the appropriate helper function to do so.
Assuming I haven't made too many terrible mistakes, my summary after this thought process is:
Are other modern languages that we tend to compare with POSIX-compliant, and if not has this hurt them?
@bradcray - Michael and I had worked out a compromise that goes in a different direction. This is a placeholder for a response to reconcile with that. It may take a few days for me to pull all the information together to respond properly.
UTF-8-only draconian chainworld
🤣 what's not to like about dragons?
[Serious question: I refer to this as a "dreamworld" because it seems the most logically consistent and sane to me, but nobody's raised any actual objections to it, maybe out of politeness or mercy. There's obviously a nontrivial implementation effort to achieve it, but is the model itself broken in some fundamental way that I'm not picking up on?)
I think my main concern with it (based on some of the description I havn't quoted here) is that there is always a possibility for loss of accuracy (added garbles) when converting between character sets (e.g. perhaps é -> e from UTF-8 to ASCII, as you pointed out). Since this conversion isn't lossless, it seems it should be explicit rather than implicit.
Also, I wanted to point out another possible design point somewhere between the chainworld and the POSIX-focused world. Instead of being a runtime error if the C LOCALE/character set wasn't UTF-8, the program would just emit a warning (and there would be a flag to disable this warning). After that, the garbleworld behavior would apply. (And another compromise point here is where we only document that the UTF-8 LOCALE/character set works and leave out the warning entirely).
Are other modern languages that we tend to compare with POSIX-compliant, and if not has this hurt them?
@dmk42 - perhaps you could look at answering this question? (Maybe you already know the answer? Try some experiments?) Thanks!
Since this conversion isn't lossless, it seems it should be explicit rather than implicit.
I think it'd be fine to require a cast rather than rely on a coercion.
This is resolved by the decision to allow other multibyte encodings for now. Someday, only UTF-8 will matter.
issue #12726 is a follow-on to this one, arguably
The current situation is that strings are always UTF-8 encoded.
UTF-8 has earned its position as the primary multibyte representation on the Internet. Clearly, that needs to be supported. Should we support other multibyte encodings as well?
Definition
Internationalization uses the term "locale" to mean the character-set conventions (including language and country) that pertain to the program's current idea of the meaning of characters. To avoid confusion with Chapel locales, and because we are primarily concerned with the character set, I will call this a "character locale" instead.
Current status
We do support multiple encodings. The character locale is chosen by setting the environment variables (for example, LANG=en_US.UTF-8). Multibyte string processing handles characters in the specified character set.
Advantages
Because our multibyte support is based on the standard C runtime library, we get support of multiple encodings essentially for free. This means we can allow lesser-used encodings without incurring a significant burden of support. Although none of the other encodings are as important as UTF-8, they still have pockets of usage where it may be difficult for the affected users to migrate to UTF-8 at this time.
Disadvantages
In addition to using the standard C runtime library, we also have an encoding/decoding fast path that is UTF-8 specific. It detects whether the current character set is UTF-8, and substitutes the fast path for the general path. If we supported only UTF-8, we would only use the encoding/decoding fast path. (However, the fast path provides encoding/decoding functionality only. Even when we are using the fast path, we still currently use the standard C runtime library for the basis of character classification such as
isAlpha()
.)The use of non-UTF-8 character sets is increasingly rare (reportedly less than 8% of the web).
Secondary question
If we support more than one multibyte encoding, should the encoding be determined on a per-program basis or a per-string basis?
Current status
The multibyte character encoding is determined on a per-program basis by examining the character locale environment variables at program start-up.
Advantages of the current approach
Determining the encoding once at program start-up is faster and simpler because we do not have to keep track of multiple character locales during program execution. Use of non-UTF-8 character sets is increasingly rare, so any use of multiple character sets in the same program would be much more rare. Keeping strings simple is a useful goal. Supporting multiple character locales would require significant development effort.
Disadvantages of the current approach
It would be difficult to write a program that translated from one character set to another if only one of those character sets could be supported by the language at a time.