Closed migueldeicaza closed 4 years ago
Do you expect the in-memory representation to be strings of 32-bit objects, or translated on the fly? What about the memory doubling if the former? What's the performance impact if the latter?
Is naming a Unicode-related technology after a particular Unicode-supported script (and a technology to improve astral plane support after a BMP script, at that) a good idea?
I think the proposal (and perhaps it needs to be made more explicit) is that the in-memory representation of strings does not change at all. The Rune
type merely represents a distinct individual 21-bit code point (stored as a 32-bit int). Methods referring to code points could potentially return a Rune
instead. Presumably there is some functionality in string
that would let you enumerate Rune
's.
I think there's a couple obvious points that we need to get consensus about for something like this:
Rune
type rather than using Int32
as current methods do?To answer (1), I think we need a fuller description of how Rune
would be exposed, what methods would receive and return it, etc. And to determine whether that is better than having those deal with Int32
instead.
As for (2), I'm a bit hesitant myself. "Rune" is sort of an esoteric word in English, and has some unusual connotations for its use in this context. There is also the point that others are bringing up: it collides with another Unicode concept. When I do a search for "Unicode Rune", I get mainly results for the Runic Unicode block, and only a few of Go language documentation.
char
is both half a word and also a full word; and you have to inspect its surroundings to determine which - like it current represents half a letter or a full letter.
Perhaps System.character
where its always a full letter... :sunglasses:
char
is a bit of a terrible representation and even for ascii/latin only languages; the rise of emoji will still permeate; it means char
is a check and maybe check next char
type
@NickCraver on twitter
While utf8 is a variable width encoding; its rare (if at all?) that a user wants to deal with half characters; both for utf8 and utf32.
A 32-bit type would work well for enumeration.
More difficult would be indexOf, Length etc for a performance or memory perspective.
However, when you start caring about actual characters, uppercasing, splitting on charaters; understanding what a character is, byte becomes variable width. Char doesn't make that really any better; it doubles the size of the smallest characters; includes more characters, but is still variable width.
For this a 32bit value might be very useful from a user code perspective. However it the has issues with position, length and secondary items (indexOf etc)
I'm very keen on an ascii only string and a utf8 string "Compact String implementation" https://github.com/dotnet/coreclr/issues/7083; for fast processing of ascii only strings
However, going against everything thing I was arguing there... I wonder what a 32bit representation of utf8 would be like? Position would map to position; seeking chars would be fast as it is in ascii, items are in native sizes etc how would it stack up against processing every byte or char to determine its size?
Conversion to and from would be more expensive; so it would be more of a processing format; than a storage format.
@migueldeicaza as I undertsand it you are only referring to expanding single character format from 16-bit char to 32-bit so all representations are contained in the value; rather than the possibility of a half-value - rather than necessarily the internal format.
However somethings to consider (i.e. relation of position, and cost of seeking, etc)
Aside: Swift also deals in whole character formats
Swift provides several different ways to access Unicode representations of strings. You can iterate over the string with a for-in statement, to access its individual Character values as Unicode extended grapheme clusters. This process is described in Working with Characters.
Alternatively, access a String value in one of three other Unicode-compliant representations:
- A collection of UTF-8 code units (accessed with the string’s utf8 property)
- A collection of UTF-16 code units (accessed with the string’s utf16 property)
- A collection of 21-bit Unicode scalar values, equivalent to the string’s UTF-32 encoding form (accessed with the string’s unicodeScalars property)
I said it in the original issue and will say it again. Abandoning what a standard says because you don't like the phrase will confuse more than it will solve, and, given there is a rune code page in Unicode, that just confuses it more.
The name is wrong.
@mellinoe
The Rune would provide many of the operations that today you expect on a Char, like ToLower[Invariant], ToUpper[Invariant], ToTitle, IsDigit, IsAlpha, IsGraphic, IsSymbol, IsControl.
Additionally, it would provide things like:
EncodeRune
(encodes a rune into a byte buffer)RuneUtf8Len
(returns the number of bytes needed to encode the rune in UTF8), IsValid
(not all Int32 values are valid)And interop to string, and Utf8string as needed.
I ported/adjusted the Go string support to .NET, and it offers a view of what this world would look like (this is without any runtime help):
https://github.com/migueldeicaza/NStack/tree/master/NStack/unicode
@benaadams said:
I wonder what a 32bit representation of utf8 would be like? Position would map to position; seeking chars would be fast as it is in ascii, items are in native sizes etc how would it stack up against processing every byte or char to determine its size?
UTF8 is an in-memory representation, that would continue to exist and would continue to be the representation (and hopefully, this is the longer term internal encoding for future strings in .NET).
You would decode the existing UTF16 strings (System.String) or the upcoming UTF8 strings (Utf8String) not into Chars (for the reason both you and I agree on), but into Runes.
Some examples, convert a Utf8 string into runes:
Does a utf8 string contain a rune:
I just noticed I did not implement the indexer ("Get me the n-th rune")
The speed of access to the Nth-rune in a string is a function of the storage, not of the Rune itself. For example, if your storage is UTF32, you have direct access to every rune. This is academic, as nobody uses that. Access to the Nth element on UTF16 and UTF8 requires the proper scanning of the elements making up the string (bytes or 16-bit ints) to determine the right boundary. Not to be confused with String[int n] { get; }
which just returns the n-th character, regardless of correctness.
@benaadams The Swift Character is a level higher up from a rune. Characters in swift are "extended grapheme clusters" which are made up of one or more runes that when they are combined produce a human readable character.
So the Swift character does not have a fixed 32-bit size, it is variable length (and we should also have that construct, but that belongs in a different data type). Here is the example from that page, but this also extends to setting the tint of an emoji:
Here’s an example. The letter é can be represented as the single Unicode scalar é (LATIN SMALL LETTER E WITH ACUTE, or U+00E9). However, the same letter can also be represented as a pair of scalars—a standard letter e (LATIN SMALL LETTER E, or U+0065), followed by the COMBINING ACUTE ACCENT scalar (U+0301). The COMBINING ACUTE ACCENT scalar is graphically applied to the scalar that precedes it, turning an e into an é when it’s rendered by a Unicode-aware text-rendering system.
Just for me grapheme
word would be more self-describing.
My two cents on the name, quoting again the Go post on strings with emphasis:
"Code point" is a bit of a mouthful, so Go introduces a shorter term for the concept: rune. The term appears in the libraries and source code, and means exactly the same as "code point", with one interesting addition.
I 100% agree with @blowdart, calling it rune is just confusing and wrong. The unicode standard mention code points three times just in the first page of the introduction chapter but the term rune appears nowhere.
If it’s a code point, then it should be named code point, simple as that.
If the term rune never appeared in the standard, it could be okay, the problem is that it appears several times in chapter 8, in relation to runes. It's not just wrong, it's actively confusing the matter with another.
Just for me
grapheme
word would be more self-describing.
If this is about 32-bit code-points the term grapheme
would be confusing because a grapheme is something else again.
I've often wanted a code-point datatype (not in a good while, as what I've worked on has changed, but a few years ago I've wanted this a lot and written overlapping partial solutions to parts of that need and could have done with a well-tested library). I don't see why this shouldn't be called something like CodePoint
. Most people who realise they needed such a type would likely be thinking in terms of code-points anyway, not in terms of runes; or else in terms of code-points and runes as separate parts of their task. ᚱᚢᚾᚪ ᛒᛇᚦ ᛥᛁᛚᛖ ᛒᚱᚣᚳᛖᚢ/rúna béoþ stille bryceu/runes are still used. I only need to use runes about once a year, and generally with parchment and ink rather than anything digital, but there are certainly people who deal with them digitally too. (Even with 20th century data, I know of a case from where they're in use in archiving WWII-era data).
Grapheme is trickier still, since one often wants to go octets → chars (nicely handled by .NET already) then chars → code-points, and then code-points → graphemes.
flagging this as up-for-grabs for now.
Next Steps: What we are looking for is: a formal proposal that will include the feedback from above (the actual naming of the type, and the advantages of using this as opposed to just using an Int32).
I have updated the issue, both with the proposed API and an initial implementation:
https://github.com/migueldeicaza/NStack/blob/master/NStack/unicode/Rune.cs
As for the naming of the type, it is both a matter of having a place where you can look for the valid operations on the type, as well as having type-specific capabilities (see the implementation for some examples).
@migueldeicaza before flagging it as ready for review, what are your thoughts regarding the concerns on the actual naming of the type, do you think that perhaps CodePoint might be better in terms of describing what it the type is?
I think the argument for using codepoint as a name is weak.
Using it is a terrible idea, in the long term, this needs to replace every single use of "char" in existing code - if we hope to get proper Unicode support.
I wish we could have used "char" like Rust does, but sadly, we already took it and we have a broken one.
Go having embraced this name is a good precedent.
I agree that code point
isn't the correct term to use here. At the very least, based on the Unicode standard it does not include values above 10FFFF (http://unicode.org/glossary/#code_point).
I don't like the term rune
. I think it has an existing use in Unicode and elsewhere that will only cause confusion overall. I also think it has a pretty good chance of conflicting with existing user types (especially for things like Unity, where a 'Rune' might represent a specific game object).
However, I do like the idea of a type that covers the C++ 11 char32_t
type, just with a different name.
There's something to be said for Char32
. It's to the point, it's analogous to the type names of the integral types. It talks at the character conceptual level, rather than the code-point level. It isn't the name of a script.
Since we are looking at having nint
how about nchar
?
The precedent would be in databases nchar
and nvarchar
Where nchar
are national char / national character and nvarchar
is national char varying / national character varying; which are the field types you can store unicode to, also some ISO standard - not sure which, maybe SQL?
What is this Unicode use of rune? That is news to me.
U+16A0 to U+16F8
It is used to refer to a specific code page in the Unicode standard. It has been brought up a few times in this thread: http://unicode.org/charts/PDF/U16A0.pdf
Ah runic, not rune.
The backing name (System.Rune or System.Char32) is not as important as the label that will be projected into C#.
Firstly: yes, yes, and more of this please. I love this idea (honestly, I've had a similar idea going for a long time now). In fact we've been using a custom string class and character struct in our Git compatibility later in Visual Studio for a while now (Git speaks in Utf-8 and transcoding everything is very slow).
On the topic of static method names, can we avoid arbitrary short-naming please? Given that Char.IsPunctuation
is the current method can we please mirror that with Rune.IsPunctuation
or similar?
Assuming (always dangerous) that this gets accepted, can we have an intrinsic rune
or c32
, or just replace char
completely with the System.Rune
implementation?
I suggest unichar
or uchar
although uchar
would look like its a unsigned char. Whichever is chosen, though, I do hope we get a language specific alias for it. I personally am a big fan of using the language aliases for primitive types.
Also I agree with @whoisj - Would definitely prefer full method names over short/abbreviations.
Also I agree with @whoisj - Would definitely prefer full method names over short/abbreviations.
IMO a language (and it's libraries) needs to choose either full, abbreviated names, or go whole hog on the abbreviations (like C with strcmp, memcpy, etc.)
or just replace
char
completely with theSystem.Rune
implementation?
That would be a breaking change for fairly obvious reasons.
That would be a breaking change for fairly obvious reasons.
My comments was mostly tongue and cheek, and hopeful. A 16-bit type for character was a mistake from the start.
Good catch on the naming, will fix.
There are other small inconsistencies in the provided API, will take a look at fixing those as well.
@migueldeicaza
Ah runic, not rune.
Runic is the adjective, rune the noun. All the runic characters are runes.
Runic is the adjective, rune the noun. All the runic characters are runes.
Fair as it seems "Cortana: define 'rune'" comes up with:
a letter of an ancient Germanic alphabet, related to the Roman alphabet.
Ah yes, whenever I see the word "rune", I immediately think of this obscure chapter on a spec nobody has read that talks about "The Runic Unicode Block".
😆 I think of childhood memories of reading Tolkien.
ᛁ᛫ᚦᛁᛜᚲ᛫ᛟᚠ᛫ᚱᚢᚾᛖᛋ
Yeah, I don't specifically think of the spec, but I do think of the type of characters that the spec refers to.
You say rune
and I think of magic, fantasy, cryptic puzzles, ancient languages, etc.
I am glad that you do not see the word "rune" and immediately think "Ah this clearly refers to the Unicode 7.0 runic block whose value will be limited to those unique values in the range 16A0..16F8".
I know that Tanner is a single voice here, and some of you are still thinking "But Miguel, I see the word 'rune' and I immediately think of a data type that could ever only hold 88 possible values". If this is a problem you are struggling with it, my brother/sister, I have news for you: you have bigger fish to fry.
I've been following this thread for a while with a mixture of excitement and hesitancy for a little over a month. I attended the Internationalization and Unicode Conference last month, and none of the presentations dealt with .NET. There is a perception problem with the .NET Framework; one that isn't necessarily unearned given the history of its globalization features. That being said, I love programming in C# and absolutely want to see new features that reinforce .NET's place in a truly global community. I think this proposal is a good step in that direction of embracing the standards that the internationalization community expects of software.
My hesitancy has mostly been over the bickering about the type name. While it is true that the designers of Go chose the name "rune", that's problematic for the reason listed above repeatedly: there are code points that are properly called runes. It is hard for me to agree with a proposal that tries to hew closely to a respected standard, and then redefines terminology that is part of the specification. Furthermore, the argument that most developers are ignorant of the term is specious given that the developers most interested in using this type correctly are more likely to understand the Unicode specification and have a good idea what a "rune" actually is. Imagine the oddity that could exist if you mixed the terminology:
Rune.IsRune(new Rune('ᛁ')); // evaluates to true
Rune.IsRune(new Rune('I')); // evaluates to false
Of course, I've taken the easy path here, critiquing without providing a new name. I think the previous suggestion of CodePoint
is the most self-descriptive option (and it appears in the original issue description), but char32
would have more parity with the existing primitive types (although I would hesitate to say that not every code point is a character). If the goal is building better Unicode support into .NET, I'm absolutely supportive of that path, but the best way to do that is to follow the spec.
Three suggestions:
The Rune class is missing the critical "IsCombining". Without that, we can't convert from a series of runes (code points) into a series of graphemes.
I'd love to also have a corresponding Grapheme class. A grapheme in this context is really just a list of one or more Runes (Code Points) such that the first rune isn't combining and the rest of the runes are combining. The use case is for when a developer needs to deal with chunks of "visible characters". For example, a + GRAVE is two runes that form one grapheme.
In networking we often get a hunk of bytes which we need to turn into a "string" like object where the bytes might not be complete (e.g., we get told of some bytes, but the last byte in a multi-byte sequence hasn't quite arrived yet). I don't see any obvious way of converting a stream of bytes into a stream of runes such that missing the last byte of a multi-byte sequence is considered a normal situation that will be rectified when we get the next set of bytes in.
And lastly, please use Unicode names and call this a CodePoint. Yes the Unicode consortium does a terrible job at explaining the difference. But the solution is to add clear and usable documentation; anything else confuses the issue instead of helping to clarify.
I do not where to start on the combining request, neither Go, Rust or Swift surface such an API on rune, Character or Unicode Scalar (their names for System.Rune
). Please provide a proposed implementation.
On grapheme clusters, it is a good idea, it should be tracked independently of System.Rune
. For what its worth, Swift use Character
for this, but also Swift is not a great model for handling strings.
Turning streams of bytes into a proper rune is a problem that belongs to a higher level API. That said, you can look at my ustring
implementation that uses the same substrate as my System.Rune
implementation to see how these buffers are mapped into utf8 strings:
https://github.com/migueldeicaza/NStack/blob/master/NStack/strings/ustring.cs
Documentation, which I have not updated yet since I introduced System.Rune
into the API, but covers it:
https://migueldeicaza.github.io/NStack/api/NStack/NStack.ustring.html
As for naming, clearly Rust is the best one with char
, but we messed that one up. The second best is Go with rune
. Anything larger than four characters will just be a nuisance for people to do the right thing.
I'm sorry; I think CodePoint
is an outstandingly good name. It's self-explanatory, memorable, and autocompletes with cp.
IsCombining
would definitely be necessary, but so too is knowing the combining class and once we have that IsCombining
is largely sugar as it's just IsCombining => CombiningClass != 0
or IsCombining => CombiningClass != CombiningClass.None
. Grapheme clusters would indeed be outside of it again, but the starting point would be knowing the combining class for default clustering, reordering, etc.
CodePoint
is a great name for a type about code points, and four characters is hardly a limit we have to deal with with other heavily used types; string
is 50% larger and doesn't prevent us using it regularly. Four randomly picked letters would be a better name than repeating Go's mistake.
Since uint
isn't CLS-compliant, there's no CLS-compliant ctor that covers the astral planes. int
would be necessary too.
Two-way implicit conversions can lead to bad things happening with overloads, so one direction should perhaps be explicit. It's not clear which. On the one hand uint
/int
is wider than code-points as values below 0 or above 10FFFF16 aren't meaningful, and having that conversion implicit allows for quicker use of more existing APIs for numbers. On the other hand I can see wanting to cast from a number to a code-point more often than the other way around.
Since uint isn't CLS-compliant, there's no CLS-compliant ctor that covers the astral planes. int would be necessary too.
That is unless a new intrinsic type were introduced into the common language. <hint, wink, hint>
JonHanna -- do you mean that these three constructors: public static implicit operator uint (Rune rune); public static implicit operator Rune (char ch); public static implicit operator Rune (uint value);
should be "int" instead of "uint". AFAICT, int easily covers the entire set of astral (non-BMP) planes.
@PeterSmithRedmond I mean that as well as the two constructors, one taking char
and one taking uint
, there should be one taking int
, but yes there should also be an int
conversion operator (just what should be implicit
and what explicit
is another question). There's no harm having uint
too for those languages that can use it; it's quite a natural match after all.
If this should replace System.Char should be possible to do "arithmetic" on it (that is ==, !=, >, < unsure on +, -, *, /) and more importantly it should be support for literals of this type for example I should be able to write:
rune r = '𐍈'; // Ostrogothic character chose on purpose as in UTF16 will be a "surrogate pairs"
If not rune
, only other synonym of character
that could work is perhaps letter
?
noun
- a written or printed communication addressed to a person or organization and usually transmitted by mail.
- a symbol or character that is conventionally used in writing and printing to represent a speech sound and that is part of an alphabet.
- a piece of printing type bearing such a symbol or character.
Though that would conflict with letter vs number
Letter has an even more precise meaning in unicode (and Net in general) than rune.
I think, if we're going to make this a Unicode character type we need to follow Unicode's naming conventions; which means "code point".
Code Point. (1) Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF16. (See definition D10 in Section 3.4, Characters and Encoding.) Not all code points are assigned to encoded characters. See code point type. (2) A value, or position, for a character, in any coded character set.
Or maybe we just give up and call a duck a "duck" and refer to them as Unicode Characters (aka uchar
).
Inspired by the discussion here:
https://github.com/dotnet/corefxlab/issues/1751
One of the challenges that .NET faces with its Unicode support is that it is rooted on a design that is nowadays obsolete. The way that we represent characters in .NET is with
System.Char
which is a 16-bit value, one that is insufficient to represent Unicode values..NET developers need to learn about the arcane Surrogate Pairs:
https://msdn.microsoft.com/en-us/library/xcwwfbb8(v=vs.110).aspx
Developers rarely use this support, mostly because they are not familiar enough with Unicode, and let alone what .NET has to offer for them.
I propose that we introduce a
System.Rune
that is backed by 32 bit integer and which corresponds to a codePoint and that we surface in C# the equivalentrune
type to be an alias to this type.rune
would become the preferred replacement forchar
and serve as the foundation for proper Unicode and string handling in .NET.As for why the name rune, the inspiration comes from Go:
https://blog.golang.org/strings
The section "Code points, characters, and runes" provides the explanation, a short version is:
Update I now have an implementation of
System.Rune
here:https://github.com/migueldeicaza/NStack/blob/master/NStack/unicode/Rune.cs
With the following API:
Update Known Issues