Ada-Rapporteur-Group / User-Community-Input

Ada User Community Input Working Group - Github Mirror Prototype

26 stars 1 forks source link

Some kind of Univ_String to provide better approach to support full Unicode? #40

Open sttaft opened 1 year ago

sttaft commented 1 year ago

Many languages are struggling with the transition to a world where Unicode is used more widely. One impetus behind Unicode these days is the ever growing number of emojis and miscellaneous symbols that have been assigned Unicode character positions, often of more than 16 bits, so requiring full 21-bit unicode support.

Ada has Wide_Characters, Wide_Strings, Wide_Wide_Characters, and Wide_WideStrings, but these require a lot of advance planning, and a decision between Wide and WideWide, both of which are annoying if the need for more Unicode support comes during maintenance of some existing program. Ada also has the UTF_Encoding package, but this results in "hiding" a UTF-8 string inside a Standard.String, which can be easier to introduce "after the fact," but then muddies the waters in terms of whether any given String is a sequence of Latin-1 characters, or a sequence of multi-byte UTF-8 encodings.

One alternative worth considering is to define a new String type, say "Univ_String", which could provide a combination of the features of say, Unbounded_String, Wide_Wide_String, UTF_Encoding, and Universal Text Buffers (Ada 2022 RM A.4.12 -- http://www.ada-auth.org/standards/22aarm/html/AA-A-4-12.html). It would be a private type with the ability to generate a UTF-8 stream, a UTF-16 stream, a Standard.String with some substitution for characters outside Latin-1, a Wide_String with some substitution for characters outside of Wide_Character, and a Wide_Wide_String, plus the other capabilities of Unbounded_String.

AdaCore has a facility called the "Virtual String System" which probably has all of this and more, and could be a good starting point. Universal Text Buffers of Ada 2022 already comes pretty close, but you would want a version of Text_IO that uses Univ_String, and Univ_String versions of operations in other packages that currently use String, or have the full String/Wide_String/Wide_Wide_String complement of operations.

mosteo commented 1 year ago

If a new type were to be introduced (to rule them all?) I'd advocate to call it just plain Text.

Richard-Wai commented 1 year ago

If a new type were to be introduced (to rule them all?) I'd advocate to call it just plain Text.

This is brilliant! I also felt uncomfortable with Univ_, but couldn't think of anything better. I love this!

ARG-Editor commented 1 year ago

This problem has been on my radar for a long time. See, for instance, the barely-baked ideas given in the !discussion of AI12-0021-1 (http://www.ada-auth.org/cgi-bin/cvsweb.cgi/ai12s/ai12-0021-1.txt?rev=1.10&r aw=N).

The problem with a single universal type is that it necessarily has to include some form of memory management. And whatever is chosen is likely to be inappropriate in some cases. Bounded forms are clunky (since Ada raises Constraint_Error when the capacities are different when using predefined :=), and unbounded forms cause unwanted use of heap memory (which causes issues with safety-critical applications). Supporting only one or the other has never been considered enough for Ada (we have usual and bounded containers, fixed, unbounded, and bounded strings, and so on).

One starts to reintroduce (worsen, really, since none of the existing string types are going anywhere) the combinational explosion if one ends up with a number of types, each with different memory management. Thus it seems likely that some form of class-wide operations will ultimately be needed.

                  Randy.

Richard-Wai commented 1 year ago

The problem with a single universal type is that it necessarily has to include some form of memory management. And whatever is chosen is likely to be inappropriate in some cases. Bounded forms are clunky (since Ada raises Constraint_Error when the capacities are different when using predefined :=), and unbounded forms cause unwanted use of heap memory (which causes issues with safety-critical applications). Supporting only one or the other has never been considered enough for Ada (we have usual and bounded containers, fixed, unbounded, and bounded strings, and so on).

I'm not sure it is necessarily certain that we need dynamic memory for such a thing. If the goal is certainly memory efficiency, then sure that might be true, but I'm not sure that is always an important goal for modern applications. In places where memory efficiency is a priority, such as deeply embedded, I'm not sure being able to output emojis or "i16n" are priorities either. I mean generally speaking we are not going to be talking about extreme amounts of data here..

All things considered, I don't think it would be crazy to allow the compiler to allocate a worse-case stack allocation of a Wide_Wide_String when the actual encoded size is unknown. Considering that UTF-8 will never take more space than the equivalent Wide_Wide_String, this would generally work, and allow UnivString/Text to be a drop-in replacement for ((Wide)Wide_)Strings.

Where the compiler knows ahead of time how much is needed (such as an initialization from a string), then it can just allocate enough for the UTF-8 encoding. Maybe there could be additional pragmas there for cases where determinism is important.

Ergo I don't think it would be totally crazy to imagine Univ_String/Text being a glorified Wide_Wide_String with better flexibility vis-a-vis more "narrow" variants or UTF8_String. Optimization such as storing the content as UTF-8 encoded, and trying to limit allocated size of such an object could be an implementation consideration.

jprosen commented 1 year ago

Hmm.... This proposal reminds me of https://xkcd.com/927/ ;-)

Le 23/02/2023 à 15:57, S. Tucker Taft a écrit :

Many languages are struggling with the transition to a world where Unicode is used more widely. One impetus behind Unicode these days is the ever growing number of emojis and miscellaneous symbols that have been assigned Unicode character positions, often of more than 16 bits, so requiring full 21-bit unicode support.

Ada has Wide_Characters, Wide_Strings, Wide_Wide_Characters, and Wide_WideStrings, but these require a lot of advance planning, and a decision between Wide and WideWide, both of which are annoying if the need for more Unicode support comes during maintenance of some existing program. Ada also has the UTF_Encoding package, but this results in "hiding" a UTF-8 string inside a Standard.String, which can be easier to introduce "after the fact," but then muddies the waters in terms of whether any given String is a sequence of Latin-1 characters, or a sequence of multi-byte UTF-8 encodings.

One alternative worth considering is to define a new String type, say "Univ_String", which could provide a combination of the features of say, Unbounded_String, Wide_Wide_String, UTF_Encoding, and Universal Text Buffers (Ada 2022 RM A.4.12 -- http://www.ada-auth.org/standards/22aarm/html/AA-A-4-12.html). It would be a private type with the ability to generate a UTF-8 stream, a UTF-16 stream, a Standard.String with some substitution for characters outside Latin-1, a Wide_String with some substitution for characters outside of Wide_Character, and a Wide_Wide_String, plus the other capabilities of Unbounded_String.

AdaCore has a facility called the "Virtual String System" which probably has all of this and more, and could be a good starting point. Universal Text Buffers of Ada 2022 already comes pretty close, but you would want a version of Text_IO that uses Univ_String, and Univ_String versions of operations in other packages that currently use String, or have the full String/Wide_String/Wide_Wide_String complement of operations.

— Reply to this email directly, view it on GitHub https://github.com/Ada-Rapporteur-Group/User-Community-Input/issues/40, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJBLSCOESBKRCE232NDCITWY53HJANCNFSM6AAAAAAVFXXW3E. You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- J-P. Rosen Adalog 2 rue du Docteur Lombard, 92441 Issy-les-Moulineaux CEDEX https://www.adalog.fr https://www.adacontrol.fr

ARG-Editor commented 1 year ago

Richard is thinking of a different design than I was. Tucker suggested a private type partially based on universal text buffers, which comes in bounded and unbounded versions. Moreover, because this is a private type, one can only set a size via a discriminant (or a global setting) - no sign of bounds here.

While one could, I suppose, specify a number of octets and use some form of fixed representation, that is rather user-hostile. It's hard to know the number of octets in a given string, and the number of characters does not correspond to a number of octets in any easily calculable way. Moreover, Richard's suggestion doesn't help anything, as you would still have to keep track of how many octets are in use at a given time (you can't just use the whole memory allocation like you might with an Ada string object).

The text buffers side-stepped this problem by talking about characters and keeping a runtime number of characters. Thus, these work like bounded and unbounded containers (there is a capacity and an active length, and they can be different). Since you have to have some sort of runtime length anyway, it might as well be presented in user-friendly terms.

Bounded containers are annoying as in general assignment (:=) raises Constraint_Error when the capacities differ, even if the values fit. One has to use an Assign procedure, or give all of the objects the same capacity. The latter can be very wasteful in space. (Consider a program that puts the text of War-and-Peace into a text value, and then has various annotations on the book in other text values. One certainly doesn't want those values to all use the same amount of memory!) On most compilers, using mutable discriminants effectively turns the objects into ones that all hold the same amount (unless the objects are constrained upon creation, but then the assignment problem has been reintroduced, so there is no gain). (On the other compilers, some form of dynamic allocation is taking place under the covers, so any intent to avoid dynamic allocation has failed in that case.)

Fully unbounded containers (text values) avoid these problems but use dynamic allocation. For most uses, this is good enough. But I have no trouble imagining safety-critical systems that want to manage text in various natural languages, such as Chinese, Japanese, or Russian. Forcing such users to use the painful existing support isn't going to benefit that important Ada customer base.

The universal text buffers tried to avoid making this choice by having an abstract root type that one would normally use in operations (rather than a specific type that is either bounded or unbounded). We would certainly have to do that here, because adding two more of everything is unappealing. We'd also have to somehow make the more advanced operations of Unbounded_Strings available (probably not in the root type). So this is a rather big job (but probably not much worse than Big_Numbers were).

                                Randy.

From: Richard Wai ***@***.*** 
Sent: Thursday, February 23, 2023 6:37 PM
To: Ada-Rapporteur-Group/User-Community-Input
Cc: ARG-Editor; Comment
Subject: Re: [Ada-Rapporteur-Group/User-Community-Input] Some kind

of Univ_String to provide better approach to support full Unicode? (Issue

40)

The problem with a single universal type is that it necessarily has to include some form of memory management. And whatever is chosen is likely to be inappropriate in some cases. Bounded forms are clunky (since Ada raises Constraint_Error when the capacities are different when using predefined :=), and unbounded forms cause unwanted use of heap memory (which causes issues with safety-critical applications). Supporting only one or the other has never been considered enough for Ada (we have usual and bounded containers, fixed, unbounded, and bounded strings, and so on).

I'm not sure it is necessarily certain that we need dynamic memory

for such a thing. If the goal is certainly memory efficiency, then sure that might be true, but I'm not sure that is always an important goal for modern applications. In places where memory efficiency is a priority, such as deeply embedded, I'm not sure being able to output emojis or "i16n" are priorities either. I mean generally speaking we are not going to be talking about extreme amounts of data here..

All things considered, I don't think it would be crazy to allow the

compiler to allocate a worse-case stack allocation of a Wide_Wide_String when the actual encoded size is unknown. Considering that UTF-8 will never take more space than the equivalent Wide_Wide_String, this would generally work, and allow UnivString/Text to be a drop-in replacement for ((Wide)Wide_)Strings.

Where the compiler knows ahead of time how much is needed (such as

an initialization from a string), then it can just allocate enough for the UTF-8 encoding. Maybe there could be additional pragmas there for cases where determinism is important.

Ergo I don't think it would be totally crazy to imagine

Univ_String/Text being a glorified Wide_Wide_String with better flexibility vis-a-vis more "narrow" variants or UTF8_String. Optimization such as storing the content as UTF-8 encoded, and trying to limit allocated size of such an object could be an implementation consideration.

-
Reply to this email directly, view it on GitHub

https://github.com/Ada-Rapporteur-Group/User-Community-Input/issues/40#issu ecomment-1442624638 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AT65YN6PUXTHL5FXBXV5SH3WY 77D7ANCNFSM6AAAAAAVFXXW3E . You are receiving this because you commented. https://github.com/notifications/beacon/AT65YN37CVGVV5JZXF5DXBDWY77D7A5CNFS M6AAAAAAVFXXW3GWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTS V7S2H4.gif Message ID: @.***>

Blady-Com commented 1 year ago

I can only encourage such an initiative. I tried such an experiment with UXStrings. My first motivation was to avoid the user of the Ada language from having to make a choice in the representation of character strings. With the current standard, the choice must be made according to the nature of the characters handled (Character, Wide_Character or Wide_Wide_Character) and the adaptation of the string size according to the operations carried out. Moreover, depending on the libraries used, making a single choice is generally not possible, which leads to continuous conversions. A single Univ_String type (or any other name) is desirable for all uses. This will simplify all current redundant libraries. This need is essential for native GUI applications. My conviction, the user does not have to make a choice and therefore the compiler must adapt to the uses. Is it possible? For me, the essential question concerns rather the proposed programming interface. A appealing interface will satisfy users. Do we care about the efficiency of Ada.Containers.Vectors? Probably a few times, but we are looking first and foremost for completeness and ease of use. If the efficiency of the compiler is less than that required by the user, he will still be able to resume the currently existing *String types. Regarding intrinsic efficiencies, some examples can help like the implementation of GNATCOLL "XString", VSS "Virtual_String" and modestly UXStrings1 or UXStrings2. I'm considering a UXStrings3 with a vector of Wide_Wide_Character whose efficiency I'll test with Gnoga applications.

sttaft commented 1 year ago

Your UXStrings looks like a nice start. And I agree, numerous examples are needed to decide on the best API from a user perspective.

Fabien-Chouteau commented 1 year ago

The way I see it:

Have two complementary types e.g. String and Text.
String is a bounded UTF-8 encoded string data based on standard arrays. More or less what we have right now but UTF-8 encoded.
Text is growable/shrinkable, also UTF-8 encoded, and provides advanced operations for manipulating its content (starts/ends with, find, split, iterate, capitalize, etc.).
Both types should be in the Standard package. Don't need to with/use a special package.
Provide encoding to/from a "standard" raw data array (byte array), i.e. System.Storage_Elements.Storage_Array.
All Text_IO stuff are able to output both String and Text
All the Wide_*, Unbounded, etc. types, packages and sub-programs are deprecated.

sttaft commented 1 year ago

Have two complementary types e.g. String and Text.

String is a bounded UTF-8 encoded string data based on standard arrays. More or less what we have right now but UTF-8 encoded.

Text is growable/shrinkable, also UTF-8 encoded, and provides advanced operations for manipulating its content (starts/ends with, find, split, iterate, capitalize, etc.).

Both types should be in the Standard package. Don't need to with/use a special package.

Provide encoding to/from a "standard" raw data array (byte array), i.e. System.Storage_Elements.Storage_Array.

All Text_IO stuff are able to output both String and Text

All the Wide_*, Unbounded, etc. types, packages and sub-programs are deprecated.

Sounds nice, but uncomfortably incompatible. Most facilities in Ada requiring "with"ing packages, so I don't see sufficient benefit from avoiding that, to justify stuffing more into package Standard. Also, I can imagine several "child" packages that might be associated with a Text type, and that wouldn't work as well if Text is in package Standard. However, I could imagine having a UTF_8_String type (as opposed to the subtype now used in UTF_Encoding) in the Text package, along with the other items you mention.

One thing I could imagine is specifying that certain of the Ada standard library packages are defined to be implicitly "with"ed and "use"d. This would help to make a simple program that much simpler, without overstuffing package Standard. Of course, deciding which ones get that treatment itself could be a battle. A better approach might be to define a configuration pragma that would allow the specification of a list of library packages that should be implicitly with'ed and use'd by all library units in the associated library, and there might be a default setting for that configuration pragma.

jprosen commented 1 year ago

Don't confuse the container and the content. On the container side: we have String for fixed containers, and Unbounded_String for variable containers. I don't see why we should add anything else, especially in a very incompatible manner (even "Text" is likely to be an identifier already used by many programs).

As for the content: you can put anything in a String, encoded or not. The only thing that's missing is search/compare etc. operations that assume that the content is UTF encoded.

Note that in my original proposal (I was the author of the AI that led to UTF_Encoding), I had UTF_String as a full type rather than a subtype of String. This was discussed, and it was felt that it would lead to many conversions that would make its use more complicated. Is there anything new here?

sttaft commented 1 year ago

Note that in my original proposal (I was the author of the AI that led to UTF_Encoding), I had UTF_String as a full type rather than a subtype of String. This was discussed, and it was felt that it would lead to many conversions that would make its use more complicated. Is there anything new here?

What is new is that we now have more experience using the UTF_Encoding packages, and how they interact with Text_IO which might be doing its own UTF encoding implicitly. Clearly, if Text_IO interprets String as a sequence of Latin-1 characters, and "knows" it is producing UTF-8-encoded output, then it will have to re-express Latin-1 characters with the high bit on, as two UTF-8-encoded characters. If in fact the original String passed to Text_IO.Put was already encoded in UTF-8, this implicit encoding, presuming a Latin-1 starting point, is going to produce a complete mess.

If, on the other hand, we had a separate type for UTF-8-encoded sequences of bytes, as well as a more abstract type that allowed efficient representation of unbounded, extensible sequences of Unicode characters ("Text"), then that would seem to provide better, more robust support for the Unicode character set.

I agree we don't want to add either of these new types to package Standard for compatibility reasons. But I do believe we should do something to make Ada friendlier to the use of extended characters, and get out of the crazy XX/Wide_XX/Wide_Wide_XX triplet of packages, types, operations, etc.

Extended characters are not just for Asian languages these days -- lots of software has started using extended characters for emojis, check boxes, etc. And even in my own limited experience with using extended characters, I have found that the existing facilities are awkward, unintuitive, and error prone.

jprosen commented 1 year ago

What is new is that we now have more experience using the UTF_Encoding packages, and how they interact with Text_IO which might be doing its own UTF encoding implicitly. Clearly, if Text_IO interprets String as a sequence of Latin-1 characters, and "knows" it is producing UTF-8-encoded output, then it will have to re-express Latin-1 characters with the high bit on, as two UTF-8-encoded characters. If in fact the original String passed to Text_IO.Put was already encoded in UTF-8, this implicit encoding, presuming a Latin-1 starting point, is going to produce a complete mess.

True, but this is more a matter of using the Form parameter - as Gnat does

If, on the other hand, we had a separate type for UTF-8-encoded sequences of bytes, as well as a more abstract type that allowed efficient representation of unbounded, extensible sequences of Unicode characters ("Text"), then that would seem to provide better, more robust support for the Unicode character set.

Like many people, you are confusing Unicode and UTF-8. Of course, Unicode is necessary nowadays, and it is provided with Wide_Character (if you only need BMP) and Wide_Wide_Character otherwise. I agree that the names are a bit of a nuisance, but that's a minor point. UTF-8 is nothing more than a compression algorithm, a way to use less bytes to represent ISO 10646 (preferred to Unicode, we are an ISO committee, aren't we ;-) ). For IO's, it is necessary to be able to produce and receive UTF-8 encoding - this can be done with the form parameter or the UTF_Encoding packages. Internally, I don't see much benefit in keeping strings as UTF-8. It makes indexing or simply getting the length of a string more difficult, and there is plenty of memory. I understand the case of big buffers where it makes sense, and packages working directly with UTF strings can be useful in that case. But I still think that the request comes from people who equate Unicode with UTF-8. The fact that many people are confused is not sufficient to justify a whole bunch of new packages, especially if they introduce incompatibilities.

briot commented 1 year ago

On 2023-03-08 09:05, Jean-Pierre Rosen wrote:

UTF-8 is nothing more than a compression algorithm, a way to use less bytes to represent ISO 10646 (preferred to Unicode, we are an ISO committee, aren't we ;-) ). For IO's, it is necessary to be able to produce and receive UTF-8 encoding - this can be done with the form parameter or the UTF_Encoding packages. Internally, I don't see much benefit in keeping strings as UTF-8. It makes indexing or simply getting the length of a string more difficult, and there is plenty of memory.

I agree with Jean-Pierre here. Storing strings in memory as utf-8 doesn't make too much sense in practice, except for backward compatibility and reusing the String type to store bytes (which we all agree creates its own ambiguities).

I had made the choice when I developed GNATCOLL.Strings several years ago: it is a generic package, but the idea is that the xstring stores codepoints, not bytes.

This is trading memory for performance (since we do not have to spend time checking we are on the boundary of a multi-byte encoding, for instance), makes it possible to use substrings more conveniently, and so on.

Using utf-8 internally might be compatible with other languages on Unix systems, which is nice. However, I believe windows prefers utf-16 for in-memory representation, for instance. So code that need to work on both linux and windows must remember to convert the internal bytes to which the system call expects. At least if we store codepoints internally, this conversion is always mandated and cannot be forgotten.

Emmanuel

mosteo commented 1 year ago

the request comes from people who equate Unicode with UTF-8. The fact that many people are confused is not sufficient to justify a whole bunch of new packages

One can know the difference and find the support for them in Ada worthy of improvement (and I didn't interpret @sttaft were conflating them?). If anything, I'd say the problem is not in the confusion of Unicode with UTF8, but in assuming String is encoding-agnostic instead of Latin1, with the implications it has for Text_IO.

I see value in a separate type that were explicitly UTF8, standard and visible by default, to complement Wide_Wide_String. I'm also wary of having the same package panoplia as the other three Character and String types. There's a difference in that a new character type woudn't be introduced, but I've not thought about where the cascade of new packages would stop, or what would be truly necessary.

One orthogonal minimalistic possibility is to have a new String_IO (or Raw_IO, or ...) that indeed treats strings as agnostic byte sequences, and you are on your own about what their encoding is. A souped-up GNAT.IO. Yes, there are other ways of achieving the same, but none of them are one "with" away, portable, and beginner-friendly.

(Btw, I realize now that Text is too easy to relate to Text_IO...)

beemirang commented 1 year ago

I've got several comments to add, but I'll start out with I've been working on this project to work through tutorials for different languages (e.g. C, C++, Java, Rust, Kotlin, Python, C#, Go so far), but writing these in Ada. As I have gone along, I've created abstractions for capabilities that were missing from the Ada standard. One such area is in String processing. I ended up creating string packages (Strings, Wide_Strings, and Wide_Wide_Strings), that are based on the Ada.Containers.Vectors package. What I have now I find is easier to use, than packages such as Unbounded_String.

To summarize what I did, first I created my own Vectors package, derived from Ada.Containers.Vectors, and added some missing functionality, particularly in the area of Slices. With my package, you can obtain slices of a vector as String, delete a slice from a string, insert a slice into a string, replace a slice of a string with another slice that is not necessarily of the same length, etc. The idea is that you should be able to do the same sort of operations that you can do with regular Ada String, that you can do with vectors, while taking advantage of Vectors resizing capabilities.

From my Vectors, Wide_Vectors, and Wide_Wide_Vectors, I created Strings, Wide_Strings, and Wide_Wide_Strings. that derives its type from instantiations of the generic packages using Character, Wide_Character, and Wide_Wide_Character.

The String types of these packages, are similar to the String types in Ada, in that they are an array of Unicode code points.

But within these packages I also derive UTF encoded types. The String package has a UTF_8_String type, the Wide_String package has a UTF_16_String type,and the Wide_Wide_String package has a UTF_32_String_Type.

For example, my Strings package has the following declarations

-- UTF-8 Encoded String Support

type UTF_String is new Unbounded_String with null record;

subtype UTF_8_String is UTF_String with Dynamic_Predicate => UTF_8_Encoded (UTF_8_String);

function UTF_8_Iterate (Container : UTF_String) return Text_Strings.Element_Vectors.Vector_Iterator_Interfaces. Reversible_Iterator'Class;

type Unicode_Code_Point is range 0 .. 16#10FFFF#;

function Code_Point (Position : Cursor) return Unicode_Code_Point;

function Code_Point (Position : Cursor) return String;

subtype Code_Point_Length_Value is Natural range 0 .. 4; subtype Code_Point_Octet_Length is Code_Point_Length_Value range 1 .. 4;

function Code_Point_Length (Position : Cursor) return Code_Point_Length_Value;

function UTF_8_Length (Item : UTF_String) return Count_Type; function UTF_8_Encoded (Item : UTF_String) return Boolean;

I can iterate through the code points of a UTF-8 encoded string for example, obtain its code-point value, either as a number or in "U-0000" format, etc.

With these, I seem to have all the capabilities I might need for processing Unicode Strings.

I made the UTF-String types derived from the String type, rather than a subtype of the String type, because I feel that they should be different types, just as Fahrenheit and Celsius would be different floating point types. It isn't save to assign a String to the UTF_String type, because while it might be OK, it might also be invalid, depending on the range of characters in the string.

If we were to look at standardizing some of these ideas, I'd say we could look at adding splicing capabilities to the existing Ada.Containers.Vectors package, and then similarly creating String packages as instantiating Vectors package.

beemirang commented 1 year ago

On the topic of Univ_String, one thing that holds some appeal to me is the idea of having a Universal_String notion in Ada. Similar to Universal_Integer, and Universal_Real, which as literals are constant values with unlimited precision without hardware representation, there might be value in having a Universal_String that similarly has unlimited size, and while in source code might have a known encoding such as UTF-8, can be assigned to objects of types with other encodings, such as UTF-16 or UTF-32.

Say something like:

S : constant := "tสวัสดี"; -- Note S does not name a type, therefore a Universal_String

S8 : constant UTF_8_String := S; S16 : constant UTF_16_String := S; S32 : constant UTF_32_String := S;

Implicit conversions to the 3 different string types occurs, similar to how implicit conversions work for Universal_Real and Universal_Integer constants.

I haven't thought through if such an idea would be possible in Ada, but seems like it might be worth considering at least.

Brad

beemirang commented 1 year ago

On the topic of being able to slice strings based on Containers.Vectors,

In my package, I can write:

S : MyStrings.String := "Hello world";

S.Replace_Slice (From => 6, To => 7, Slice => "My Great Big W"); -- Gives "Hello My Great Big World"

It would be nice if we could somehow add syntactic sugar to be able to write this similar to a regular Ada String.

For example:

S (6 .. 7) := "My Great Big W";

Here we wouldn't be limited to having the replacement slice be the same size as the target slice.

Brad

beemirang commented 1 year ago

I think some work is needed towards improving the ability to write string literals. I think; 1) It should be possible to express strings naturally using their symbolic representation, rather than have to specify a sequence of hex values. 2) A string literal should be expressable consistently regardless whether it is within the Latin-1 range, or outside that range. 3) The manner of expressing a string literal should be consistent between UTF-8, UTF-16, and UTF-32 (String, Wide_String, and Wide_Wide_String) types. 4) Special compiler switches ideally should not be needed for this all to work

Currently, for example with the GNAT compiler, you need to use either the -gnatWb or -gnatW8 compile switch to declare string literals with non-latin characters. The -gnatW8 switch perhaps most closely matches what I describe above, except that you end up having to declare strings with non-latin characters as Wide_Wide_Strings, and then call Encode functions from Ada.Strings.UTF_Encoding.Wide_Wide_Strings to convert these to UTF-8 (String) or UTF-16 (Wide_String) types. My understanding of how this works is effectively that the UTF-8 encoded literal in the source code is converted to an set of Unicode Code Points, then those code point values are converted to either Character, Wide_Character, or Wide_Wide_Character, depending on the context of the type of the expression. If the code point values are outside the range of the Character type, then you get a compile error. So Latin-1 characters never have an issue, but code points outside that range cannot be assigned to a Character type. I think what is missing is the ability to have an implicit conversion from the literal to the corresponding Unicode type, UTF-8, UTF-16 is the String type is String or Wide_String respectively. Then you could initialize any string type with a string literal containing any unicode characters. Perhaps the Universal_String idea would be helpful here.

sttaft commented 1 year ago

I think it is quite important to distinguish issues relating to source-code representation, lexical semantics, and run-time input/output.

Standard.Character is semantically a representative of a character from the Latin-1 subset of Unicode. No doubt there are programs that treat it as a representative of other Latin-n character sets, or as bytes drawn from a multi-byte representation (such as Shift-JIS and UTF-8). This is not ideal, and the language should provide means to avoid having to do that if possible.

Similarly, Wide_Character represents a character from the BMP (Basic Multilingual Plane) of Unicode, but almost certainly other interpretations are used in certain programs (such as Shift-JIS or UTF-16).

Wide_Wide_Character is probably the only character type that seems to be unambiguously used to represent a single Unicode character, though it is possible that some of the "high bits" of the 32-bit word are hijacked for other purposes.

String and Wide_String are also used for multiple distinct semantic purposes, while again Wide_Wide_String seems to be used, generally, as an array of Unicode characters.

I like the suggestion of adopting the term "Text" for an extensible vector of Unicode characters (though not in package Standard!), and Brad's suggestions about slicing make a lot of sense as well (I believe VSS supports this sort of thing, but the language itself could give more capability here). Ideally "Text" would be usable with Character, Wide_Character, and Wide_Wide_Characters, but always interpreting them as Latin-1, Unicode BMP, and full Unicode.

So the above is all about semantics.

A distinct issue is source-code representation. Ultimately, from a semantic point of view, source code should be seen as a sequence of Unicode characters. Whether this sequence is represented using UTF-8, UTF-16, UTF-32, or something like Shift-JIS, Latin-1, etc., is an appropriate thing to be determined by compiler switches, such as -gnatW8.

Finally, there is run-time input/output, as represented by Text_IO. This should not be tied to source-code representation in my view, so the fact that -gnatW8 affects Text_IO in GNAT seems like a source of confusion (especially for standard input/output, where there is rarely a connection between source-code representation and what the terminal is expecting).

Using something like GNAT's "WCEM" form parameter makes sense here, as different files might very well have different encodings. As mentioned, tying this to the source-code representation is not a great idea in my view, and has certainly created confusion in my experience.

So ... I would encourage any discussion to clearly separate these three aspects of "universal" string support (and there very well might be more than three distinct issues).

eggrobin commented 1 year ago

Elaborating on some comments I made at the ARG meeting in response to Tucker’s comment above:

Standard.Character is semantically a representative of a character from the Latin-1 subset of Unicode. No doubt there are programs that treat it as a representative of other Latin-n character sets, or as bytes drawn from a multi-byte representation (such as Shift-JIS and UTF-8). This is not ideal, and the language should provide means to avoid having to do that if possible.

Similarly, Wide_Character represents a character from the BMP (Basic Multilingual Plane) of Unicode, but almost certainly other interpretations are used in certain programs (such as Shift-JIS or UTF-16).

Wide_String as UTF-16 (and Wide_Character as UTF-16 code unit) is very different from String as UTF-8 (and Character as UTF-8 code unit), let alone anything involving Shift-JIS.

If String is used both for UTF-8 and for Latin-1, the String Character'Val(16#C3#) & Character'Val(16#A9#) is genuinely ambiguous between "Ã©" and "é".

If Wide_String is used is used both for sequences of BMP characters and UTF-16 strings, there is no such ambiguity: the Wide_String Wide_Character'Val(16#D808#) & Wide_Character'Val(16#DE6D#) is ill-formed under the first interpretation, because it is a sequence of things that are not characters (see Table 2-3 in the Unicode Standard, Version 15.0) and indeed are not interchangeable in any UTF (Unicode encoding forms only encode Unicode scalar values). That Wide_String is "𒉭" under the second interpretation.

Tucker noted that since the indexing of a Wide_String is in terms of 16-bit units, a Wide_String holding UTF-16 can be sliced into something that is not valid UTF-16.

This is true. However, neither Wide_String nor Wide_Wide_String really provide guarantees of well-formedness: Wide_Wide_Character allows the use of all 32 bits (thus including 32-bit integers that are not code points), and both Wide_Character and Wide_Wide_Character allow for surrogates.

The subtypes of Wide_String and Wide_Wide_String for valid UTF-16 and UTF-32, respectively, would look something like this:

   subtype High_Surrogate is Wide_Character
      range Wide_Character'Val (16#D800#) .. Wide_Character'Val (16#DB7F#);
   subtype Low_Surrogate is Wide_Character
      range Wide_Character'Val (16#DC00#) .. Wide_Character'Val (16#DFFF#);
   subtype UTF_16_String is Wide_String
      with Dynamic_Predicate =>
        (for all I in UTF_16_String'Range =>
            (if UTF_16_String (I) in High_Surrogate then
               I + 1 in UTF_16_String'Range and then
                  UTF_16_String (I + 1) in Low_Surrogate));

   subtype Code_Point is Wide_Wide_Character range
      Wide_Wide_Character'Val (0) .. Wide_Wide_Character'Val (16#10FFFF#);
   subtype Scalar_Value is Code_Point with
      Static_Predicate => Scalar_Value not in
         Code_Point'Val (16#D800#) .. Code_Point'Val (16#DFFF#);
   subtype UTF_32_String is Wide_Wide_String
      with Dynamic_Predicate =>
         (for all Code_Unit of UTF_32_String => Code_Unit in Scalar_Value);

The code point is the basic unit in Unicode algorithms, so when handling UTF-16 strings, iteration over code points is useful; indeed the languages whose string type is UTF-16 tend to provide that: see Java’s [String.codePoints](https://docs.oracle.com/en/java/javase/20/docs/api/java.base/java/lang/String.html#codePoints()) and C#’s String.EnumerateRunes.

It was mentioned in the meeting that there are larger units that are often useful, most notably Grapheme Clusters (“user-perceived characters”).