dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.27k stars 4.73k forks source link

Introducing System.Rune #23578

Closed migueldeicaza closed 4 years ago

migueldeicaza commented 7 years ago

Inspired by the discussion here:

https://github.com/dotnet/corefxlab/issues/1751

One of the challenges that .NET faces with its Unicode support is that it is rooted on a design that is nowadays obsolete. The way that we represent characters in .NET is with System.Char which is a 16-bit value, one that is insufficient to represent Unicode values.

.NET developers need to learn about the arcane Surrogate Pairs:

https://msdn.microsoft.com/en-us/library/xcwwfbb8(v=vs.110).aspx

Developers rarely use this support, mostly because they are not familiar enough with Unicode, and let alone what .NET has to offer for them.

I propose that we introduce a System.Rune that is backed by 32 bit integer and which corresponds to a codePoint and that we surface in C# the equivalent rune type to be an alias to this type.

rune would become the preferred replacement for char and serve as the foundation for proper Unicode and string handling in .NET.

As for why the name rune, the inspiration comes from Go:

https://blog.golang.org/strings

The section "Code points, characters, and runes" provides the explanation, a short version is:

"Code point" is a bit of a mouthful, so Go introduces a shorter term for the concept: rune. The term appears in the libraries and source code, and means exactly the same as "code point", with one interesting addition.

Update I now have an implementation of System.Rune here:

https://github.com/migueldeicaza/NStack/blob/master/NStack/unicode/Rune.cs

With the following API:

public struct Rune {

    public Rune (uint rune);
    public Rune (char ch);

    public static ValueTuple<Rune,int> DecodeLastRune (byte [] buffer, int end);
    public static ValueTuple<Rune,int> DecodeLastRune (NStack.ustring str, int end);
    public static ValueTuple<Rune,int> DecodeRune (byte [] buffer, int start, int n);
    public static ValueTuple<Rune,int> DecodeRune (NStack.ustring str, int start, int n);
    public static int EncodeRune (Rune rune, byte [] dest, int offset);
    public static bool FullRune (byte [] p);
    public static bool FullRune (NStack.ustring str);
    public static int InvalidIndex (byte [] buffer);
    public static int InvalidIndex (NStack.ustring str);
    public static bool IsControl (Rune rune);
    public static bool IsDigit (Rune rune);
    public static bool IsGraphic (Rune rune);
    public static bool IsLetter (Rune rune);
    public static bool IsLower (Rune rune);
    public static bool IsMark (Rune rune);
    public static bool IsNumber (Rune rune);
    public static bool IsPrint (Rune rune);
    public static bool IsPunctuation (Rune rune);
    public static bool IsSpace (Rune rune);
    public static bool IsSymbol (Rune rune);
    public static bool IsTitle (Rune rune);
    public static bool IsUpper (Rune rune);
    public static int RuneCount (byte [] buffer, int offset, int count);
    public static int RuneCount (NStack.ustring str);
    public static int RuneLen (Rune rune);
    public static Rune SimpleFold (Rune rune);
    public static Rune To (Case toCase, Rune rune);
    public static Rune ToLower (Rune rune);
    public static Rune ToTitle (Rune rune);
    public static Rune ToUpper (Rune rune);
    public static bool Valid (byte [] buffer);
    public static bool Valid (NStack.ustring str);
    public static bool ValidRune (Rune rune);
    public override bool Equals (object obj);

    [System.Runtime.ConstrainedExecution.ReliabilityContractAttribute((System.Runtime.ConstrainedExecution.Consistency)3, (System.Runtime.ConstrainedExecution.Cer)2)]
    protected virtual void Finalize ();
    public override int GetHashCode ();
    public Type GetType ();
    protected object MemberwiseClone ();
    public override string ToString ();

    public static implicit operator uint (Rune rune);
    public static implicit operator Rune (char ch);
    public static implicit operator Rune (uint value);

    public bool IsValid {
        get;
    }

    public static Rune Error;
    public static Rune MaxRune;
    public const byte RuneSelf = 128;
    public static Rune ReplacementChar;
    public const int Utf8Max = 4;

    public enum Case {
        Upper,
        Lower,
        Title
    }
}

Update Known Issues

JonHanna commented 7 years ago

Do you expect the in-memory representation to be strings of 32-bit objects, or translated on the fly? What about the memory doubling if the former? What's the performance impact if the latter?

Is naming a Unicode-related technology after a particular Unicode-supported script (and a technology to improve astral plane support after a BMP script, at that) a good idea?

mellinoe commented 7 years ago

I think the proposal (and perhaps it needs to be made more explicit) is that the in-memory representation of strings does not change at all. The Rune type merely represents a distinct individual 21-bit code point (stored as a 32-bit int). Methods referring to code points could potentially return a Rune instead. Presumably there is some functionality in string that would let you enumerate Rune's.

I think there's a couple obvious points that we need to get consensus about for something like this:

  1. Is there significant value in creating a Rune type rather than using Int32 as current methods do?
  2. Is the word "rune" actually a good choice?

To answer (1), I think we need a fuller description of how Rune would be exposed, what methods would receive and return it, etc. And to determine whether that is better than having those deal with Int32 instead.

As for (2), I'm a bit hesitant myself. "Rune" is sort of an esoteric word in English, and has some unusual connotations for its use in this context. There is also the point that others are bringing up: it collides with another Unicode concept. When I do a search for "Unicode Rune", I get mainly results for the Runic Unicode block, and only a few of Go language documentation.

benaadams commented 7 years ago

char is both half a word and also a full word; and you have to inspect its surroundings to determine which - like it current represents half a letter or a full letter.

Perhaps System.character where its always a full letter... :sunglasses:

benaadams commented 7 years ago

char is a bit of a terrible representation and even for ascii/latin only languages; the rise of emoji will still permeate; it means char is a check and maybe check next char type

@NickCraver on twitter

While utf8 is a variable width encoding; its rare (if at all?) that a user wants to deal with half characters; both for utf8 and utf32.

A 32-bit type would work well for enumeration.

More difficult would be indexOf, Length etc for a performance or memory perspective.

  1. byte array is best representation for an opaque format; e.g. keeping the format in its original format or a final format (file transfer, putting on wire etc)
  2. byte array is best representation for memory bandwidth and memory size
  3. byte array is consistent with Position and indexOf, Length etc in terms of bytes

However, when you start caring about actual characters, uppercasing, splitting on charaters; understanding what a character is, byte becomes variable width. Char doesn't make that really any better; it doubles the size of the smallest characters; includes more characters, but is still variable width.

For this a 32bit value might be very useful from a user code perspective. However it the has issues with position, length and secondary items (indexOf etc)

I'm very keen on an ascii only string and a utf8 string "Compact String implementation" https://github.com/dotnet/coreclr/issues/7083; for fast processing of ascii only strings

However, going against everything thing I was arguing there... I wonder what a 32bit representation of utf8 would be like? Position would map to position; seeking chars would be fast as it is in ascii, items are in native sizes etc how would it stack up against processing every byte or char to determine its size?

Conversion to and from would be more expensive; so it would be more of a processing format; than a storage format.

@migueldeicaza as I undertsand it you are only referring to expanding single character format from 16-bit char to 32-bit so all representations are contained in the value; rather than the possibility of a half-value - rather than necessarily the internal format.

However somethings to consider (i.e. relation of position, and cost of seeking, etc)

Aside: Swift also deals in whole character formats

Swift provides several different ways to access Unicode representations of strings. You can iterate over the string with a for-in statement, to access its individual Character values as Unicode extended grapheme clusters. This process is described in Working with Characters.

Alternatively, access a String value in one of three other Unicode-compliant representations:

  • A collection of UTF-8 code units (accessed with the string’s utf8 property)
  • A collection of UTF-16 code units (accessed with the string’s utf16 property)
  • A collection of 21-bit Unicode scalar values, equivalent to the string’s UTF-32 encoding form (accessed with the string’s unicodeScalars property)
blowdart commented 7 years ago

I said it in the original issue and will say it again. Abandoning what a standard says because you don't like the phrase will confuse more than it will solve, and, given there is a rune code page in Unicode, that just confuses it more.

The name is wrong.

migueldeicaza commented 7 years ago

@mellinoe

The Rune would provide many of the operations that today you expect on a Char, like ToLower[Invariant], ToUpper[Invariant], ToTitle, IsDigit, IsAlpha, IsGraphic, IsSymbol, IsControl.

Additionally, it would provide things like:

And interop to string, and Utf8string as needed.

I ported/adjusted the Go string support to .NET, and it offers a view of what this world would look like (this is without any runtime help):

https://github.com/migueldeicaza/NStack/tree/master/NStack/unicode

@benaadams said:

I wonder what a 32bit representation of utf8 would be like? Position would map to position; seeking chars would be fast as it is in ascii, items are in native sizes etc how would it stack up against processing every byte or char to determine its size?

UTF8 is an in-memory representation, that would continue to exist and would continue to be the representation (and hopefully, this is the longer term internal encoding for future strings in .NET).

You would decode the existing UTF16 strings (System.String) or the upcoming UTF8 strings (Utf8String) not into Chars (for the reason both you and I agree on), but into Runes.

Some examples, convert a Utf8 string into runes:

https://github.com/migueldeicaza/NStack/blob/6a071ca5c026ca71c10ead4f7232e2fa0673baf9/NStack/strings/ustring.cs#L756

Does a utf8 string contain a rune:

https://github.com/migueldeicaza/NStack/blob/6a071ca5c026ca71c10ead4f7232e2fa0673baf9/NStack/strings/ustring.cs#L855

I just noticed I did not implement the indexer ("Get me the n-th rune")

The speed of access to the Nth-rune in a string is a function of the storage, not of the Rune itself. For example, if your storage is UTF32, you have direct access to every rune. This is academic, as nobody uses that. Access to the Nth element on UTF16 and UTF8 requires the proper scanning of the elements making up the string (bytes or 16-bit ints) to determine the right boundary. Not to be confused with String[int n] { get; } which just returns the n-th character, regardless of correctness.

migueldeicaza commented 7 years ago

@benaadams The Swift Character is a level higher up from a rune. Characters in swift are "extended grapheme clusters" which are made up of one or more runes that when they are combined produce a human readable character.

So the Swift character does not have a fixed 32-bit size, it is variable length (and we should also have that construct, but that belongs in a different data type). Here is the example from that page, but this also extends to setting the tint of an emoji:

Here’s an example. The letter é can be represented as the single Unicode scalar é (LATIN SMALL LETTER E WITH ACUTE, or U+00E9). However, the same letter can also be represented as a pair of scalars—a standard letter e (LATIN SMALL LETTER E, or U+0065), followed by the COMBINING ACUTE ACCENT scalar (U+0301). The COMBINING ACUTE ACCENT scalar is graphically applied to the scalar that precedes it, turning an e into an é when it’s rendered by a Unicode-aware text-rendering system.

xplicit commented 7 years ago

Just for me grapheme word would be more self-describing.

0xced commented 7 years ago

My two cents on the name, quoting again the Go post on strings with emphasis:

"Code point" is a bit of a mouthful, so Go introduces a shorter term for the concept: rune. The term appears in the libraries and source code, and means exactly the same as "code point", with one interesting addition.

I 100% agree with @blowdart, calling it rune is just confusing and wrong. The unicode standard mention code points three times just in the first page of the introduction chapter but the term rune appears nowhere.

If it’s a code point, then it should be named code point, simple as that.

JonHanna commented 7 years ago

If the term rune never appeared in the standard, it could be okay, the problem is that it appears several times in chapter 8, in relation to runes. It's not just wrong, it's actively confusing the matter with another.

JonHanna commented 7 years ago

Just for me grapheme word would be more self-describing.

If this is about 32-bit code-points the term grapheme would be confusing because a grapheme is something else again.

I've often wanted a code-point datatype (not in a good while, as what I've worked on has changed, but a few years ago I've wanted this a lot and written overlapping partial solutions to parts of that need and could have done with a well-tested library). I don't see why this shouldn't be called something like CodePoint. Most people who realise they needed such a type would likely be thinking in terms of code-points anyway, not in terms of runes; or else in terms of code-points and runes as separate parts of their task. ᚱᚢᚾᚪ ᛒᛇᚦ ᛥᛁᛚᛖ ᛒᚱᚣᚳᛖᚢ/rúna béoþ stille bryceu/runes are still used. I only need to use runes about once a year, and generally with parchment and ink rather than anything digital, but there are certainly people who deal with them digitally too. (Even with 20th century data, I know of a case from where they're in use in archiving WWII-era data).

Grapheme is trickier still, since one often wants to go octets → chars (nicely handled by .NET already) then chars → code-points, and then code-points → graphemes.

joperezr commented 7 years ago

flagging this as up-for-grabs for now.

Next Steps: What we are looking for is: a formal proposal that will include the feedback from above (the actual naming of the type, and the advantages of using this as opposed to just using an Int32).

migueldeicaza commented 7 years ago

I have updated the issue, both with the proposed API and an initial implementation:

https://github.com/migueldeicaza/NStack/blob/master/NStack/unicode/Rune.cs

As for the naming of the type, it is both a matter of having a place where you can look for the valid operations on the type, as well as having type-specific capabilities (see the implementation for some examples).

joperezr commented 7 years ago

@migueldeicaza before flagging it as ready for review, what are your thoughts regarding the concerns on the actual naming of the type, do you think that perhaps CodePoint might be better in terms of describing what it the type is?

migueldeicaza commented 7 years ago

I think the argument for using codepoint as a name is weak.

Using it is a terrible idea, in the long term, this needs to replace every single use of "char" in existing code - if we hope to get proper Unicode support.

I wish we could have used "char" like Rust does, but sadly, we already took it and we have a broken one.

Go having embraced this name is a good precedent.

tannergooding commented 7 years ago

I agree that code point isn't the correct term to use here. At the very least, based on the Unicode standard it does not include values above 10FFFF (http://unicode.org/glossary/#code_point).

I don't like the term rune. I think it has an existing use in Unicode and elsewhere that will only cause confusion overall. I also think it has a pretty good chance of conflicting with existing user types (especially for things like Unity, where a 'Rune' might represent a specific game object).

However, I do like the idea of a type that covers the C++ 11 char32_t type, just with a different name.

JonHanna commented 7 years ago

There's something to be said for Char32. It's to the point, it's analogous to the type names of the integral types. It talks at the character conceptual level, rather than the code-point level. It isn't the name of a script.

benaadams commented 7 years ago

Since we are looking at having nint how about nchar?

The precedent would be in databases nchar and nvarchar

Where nchar are national char / national character and nvarchar is national char varying / national character varying; which are the field types you can store unicode to, also some ISO standard - not sure which, maybe SQL?

migueldeicaza commented 7 years ago

What is this Unicode use of rune? That is news to me.

JonHanna commented 7 years ago

U+16A0 to U+16F8

tannergooding commented 7 years ago

It is used to refer to a specific code page in the Unicode standard. It has been brought up a few times in this thread: http://unicode.org/charts/PDF/U16A0.pdf

migueldeicaza commented 7 years ago

Ah runic, not rune.

migueldeicaza commented 7 years ago

The backing name (System.Rune or System.Char32) is not as important as the label that will be projected into C#.

whoisj commented 7 years ago

Firstly: yes, yes, and more of this please. I love this idea (honestly, I've had a similar idea going for a long time now). In fact we've been using a custom string class and character struct in our Git compatibility later in Visual Studio for a while now (Git speaks in Utf-8 and transcoding everything is very slow).

On the topic of static method names, can we avoid arbitrary short-naming please? Given that Char.IsPunctuation is the current method can we please mirror that with Rune.IsPunctuation or similar?

Assuming (always dangerous) that this gets accepted, can we have an intrinsic rune or c32, or just replace char completely with the System.Rune implementation?

jakesays-old commented 7 years ago

I suggest unichar or uchar although uchar would look like its a unsigned char. Whichever is chosen, though, I do hope we get a language specific alias for it. I personally am a big fan of using the language aliases for primitive types.

Also I agree with @whoisj - Would definitely prefer full method names over short/abbreviations.

whoisj commented 7 years ago

Also I agree with @whoisj - Would definitely prefer full method names over short/abbreviations.

IMO a language (and it's libraries) needs to choose either full, abbreviated names, or go whole hog on the abbreviations (like C with strcmp, memcpy, etc.)

Joe4evr commented 7 years ago

or just replace char completely with the System.Rune implementation?

That would be a breaking change for fairly obvious reasons.

whoisj commented 7 years ago

That would be a breaking change for fairly obvious reasons.

My comments was mostly tongue and cheek, and hopeful. A 16-bit type for character was a mistake from the start.

migueldeicaza commented 7 years ago

Good catch on the naming, will fix.

There are other small inconsistencies in the provided API, will take a look at fixing those as well.

JonHanna commented 7 years ago

@migueldeicaza

Ah runic, not rune.

Runic is the adjective, rune the noun. All the runic characters are runes.

whoisj commented 7 years ago

Runic is the adjective, rune the noun. All the runic characters are runes.

Fair as it seems "Cortana: define 'rune'" comes up with:

a letter of an ancient Germanic alphabet, related to the Roman alphabet.

migueldeicaza commented 7 years ago

Ah yes, whenever I see the word "rune", I immediately think of this obscure chapter on a spec nobody has read that talks about "The Runic Unicode Block".

jnm2 commented 7 years ago

😆 I think of childhood memories of reading Tolkien.

JonHanna commented 7 years ago

ᛁ᛫ᚦᛁᛜᚲ᛫ᛟᚠ᛫ᚱᚢᚾᛖᛋ

tannergooding commented 7 years ago

Yeah, I don't specifically think of the spec, but I do think of the type of characters that the spec refers to.

You say rune and I think of magic, fantasy, cryptic puzzles, ancient languages, etc.

migueldeicaza commented 6 years ago

I am glad that you do not see the word "rune" and immediately think "Ah this clearly refers to the Unicode 7.0 runic block whose value will be limited to those unique values in the range 16A0..16F8".

I know that Tanner is a single voice here, and some of you are still thinking "But Miguel, I see the word 'rune' and I immediately think of a data type that could ever only hold 88 possible values". If this is a problem you are struggling with it, my brother/sister, I have news for you: you have bigger fish to fry.

mgw854 commented 6 years ago

I've been following this thread for a while with a mixture of excitement and hesitancy for a little over a month. I attended the Internationalization and Unicode Conference last month, and none of the presentations dealt with .NET. There is a perception problem with the .NET Framework; one that isn't necessarily unearned given the history of its globalization features. That being said, I love programming in C# and absolutely want to see new features that reinforce .NET's place in a truly global community. I think this proposal is a good step in that direction of embracing the standards that the internationalization community expects of software.

My hesitancy has mostly been over the bickering about the type name. While it is true that the designers of Go chose the name "rune", that's problematic for the reason listed above repeatedly: there are code points that are properly called runes. It is hard for me to agree with a proposal that tries to hew closely to a respected standard, and then redefines terminology that is part of the specification. Furthermore, the argument that most developers are ignorant of the term is specious given that the developers most interested in using this type correctly are more likely to understand the Unicode specification and have a good idea what a "rune" actually is. Imagine the oddity that could exist if you mixed the terminology:

Rune.IsRune(new Rune('ᛁ')); // evaluates to true
Rune.IsRune(new Rune('I')); // evaluates to false

Of course, I've taken the easy path here, critiquing without providing a new name. I think the previous suggestion of CodePoint is the most self-descriptive option (and it appears in the original issue description), but char32 would have more parity with the existing primitive types (although I would hesitate to say that not every code point is a character). If the goal is building better Unicode support into .NET, I'm absolutely supportive of that path, but the best way to do that is to follow the spec.

PeterSmithRedmond commented 6 years ago

Three suggestions:

  1. The Rune class is missing the critical "IsCombining". Without that, we can't convert from a series of runes (code points) into a series of graphemes.

  2. I'd love to also have a corresponding Grapheme class. A grapheme in this context is really just a list of one or more Runes (Code Points) such that the first rune isn't combining and the rest of the runes are combining. The use case is for when a developer needs to deal with chunks of "visible characters". For example, a + GRAVE is two runes that form one grapheme.

  3. In networking we often get a hunk of bytes which we need to turn into a "string" like object where the bytes might not be complete (e.g., we get told of some bytes, but the last byte in a multi-byte sequence hasn't quite arrived yet). I don't see any obvious way of converting a stream of bytes into a stream of runes such that missing the last byte of a multi-byte sequence is considered a normal situation that will be rectified when we get the next set of bytes in.

And lastly, please use Unicode names and call this a CodePoint. Yes the Unicode consortium does a terrible job at explaining the difference. But the solution is to add clear and usable documentation; anything else confuses the issue instead of helping to clarify.

migueldeicaza commented 6 years ago

I do not where to start on the combining request, neither Go, Rust or Swift surface such an API on rune, Character or Unicode Scalar (their names for System.Rune). Please provide a proposed implementation.

On grapheme clusters, it is a good idea, it should be tracked independently of System.Rune. For what its worth, Swift use Character for this, but also Swift is not a great model for handling strings.

Turning streams of bytes into a proper rune is a problem that belongs to a higher level API. That said, you can look at my ustring implementation that uses the same substrate as my System.Rune implementation to see how these buffers are mapped into utf8 strings:

https://github.com/migueldeicaza/NStack/blob/master/NStack/strings/ustring.cs

Documentation, which I have not updated yet since I introduced System.Rune into the API, but covers it:

https://migueldeicaza.github.io/NStack/api/NStack/NStack.ustring.html

As for naming, clearly Rust is the best one with char, but we messed that one up. The second best is Go with rune. Anything larger than four characters will just be a nuisance for people to do the right thing.

jnm2 commented 6 years ago

I'm sorry; I think CodePoint is an outstandingly good name. It's self-explanatory, memorable, and autocompletes with cp.

JonHanna commented 6 years ago

IsCombining would definitely be necessary, but so too is knowing the combining class and once we have that IsCombining is largely sugar as it's just IsCombining => CombiningClass != 0 or IsCombining => CombiningClass != CombiningClass.None. Grapheme clusters would indeed be outside of it again, but the starting point would be knowing the combining class for default clustering, reordering, etc.

CodePoint is a great name for a type about code points, and four characters is hardly a limit we have to deal with with other heavily used types; string is 50% larger and doesn't prevent us using it regularly. Four randomly picked letters would be a better name than repeating Go's mistake.

jnm2 commented 6 years ago

https://www.random.org/strings/?num=10&len=4&loweralpha=on&unique=on&format=html&rnd=new

JonHanna commented 6 years ago

Since uint isn't CLS-compliant, there's no CLS-compliant ctor that covers the astral planes. int would be necessary too.

Two-way implicit conversions can lead to bad things happening with overloads, so one direction should perhaps be explicit. It's not clear which. On the one hand uint/int is wider than code-points as values below 0 or above 10FFFF16 aren't meaningful, and having that conversion implicit allows for quicker use of more existing APIs for numbers. On the other hand I can see wanting to cast from a number to a code-point more often than the other way around.

whoisj commented 6 years ago

Since uint isn't CLS-compliant, there's no CLS-compliant ctor that covers the astral planes. int would be necessary too.

That is unless a new intrinsic type were introduced into the common language. <hint, wink, hint>

PeterSmithRedmond commented 6 years ago

JonHanna -- do you mean that these three constructors: public static implicit operator uint (Rune rune); public static implicit operator Rune (char ch); public static implicit operator Rune (uint value);

should be "int" instead of "uint". AFAICT, int easily covers the entire set of astral (non-BMP) planes.

JonHanna commented 6 years ago

@PeterSmithRedmond I mean that as well as the two constructors, one taking char and one taking uint, there should be one taking int, but yes there should also be an int conversion operator (just what should be implicit and what explicit is another question). There's no harm having uint too for those languages that can use it; it's quite a natural match after all.

fanoI commented 6 years ago

If this should replace System.Char should be possible to do "arithmetic" on it (that is ==, !=, >, < unsure on +, -, *, /) and more importantly it should be support for literals of this type for example I should be able to write:

rune r = '𐍈'; // Ostrogothic character chose on purpose as in UTF16 will be a "surrogate pairs"
benaadams commented 6 years ago

image

If not rune, only other synonym of character that could work is perhaps letter?

noun

  1. a written or printed communication addressed to a person or organization and usually transmitted by mail.
  2. a symbol or character that is conventionally used in writing and printing to represent a speech sound and that is part of an alphabet.
  3. a piece of printing type bearing such a symbol or character.

Though that would conflict with letter vs number

tannergooding commented 6 years ago

Letter has an even more precise meaning in unicode (and Net in general) than rune.

whoisj commented 6 years ago

I think, if we're going to make this a Unicode character type we need to follow Unicode's naming conventions; which means "code point".

Code Point. (1) Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF16. (See definition D10 in Section 3.4, Characters and Encoding.) Not all code points are assigned to encoded characters. See code point type. (2) A value, or position, for a character, in any coded character set.

Or maybe we just give up and call a duck a "duck" and refer to them as Unicode Characters (aka uchar).