Handling of encodings of Dao strings

daokoder commented 10 years ago

I have been considering to change a few things about Dao strings. You know, Dao can store a string either as Multi-Byte String (MBS) or Wide Character String (WCS), and support automatic conversions between them. Currently, when converting MBS to and from WCS, a MBS string is assumed to be encoded with system encoding.

I am considering to change the assumed encoding to UTF-8, this will not only make conversion between MBS and WCS potentially a lot more efficient, probably also make Dao programs more portable.

Basically the following will be done:

Program source code will be converted to UTF-8 upon parsing if it is not in UTF-8 already;
When a MBS is converted to WCS, the bytes in UTF-8 will be converted, and the other bytes will be simply copied; When converting WCS to MBS, similar rules will apply;
When chars are appended to MBS or wide chars are appended to WCS, they are simply copied;
When chars are appended to WCS, or wide chars to MBS, they are first converted in the way as mentioned above;
When a string is printed with %s, it will be converted to system encoding MBS first; When a string is printed with %S (planned), MBS will be printed as it is, and WCS will be converted to UTF-8 MBS for printing.
New methods will be added for checking and converting UTF-8 strings;

These should have covered most scenarios that require handling of string encoding. If I missed something, please add.

@Night-walker I may need your UTF-8 encoder for this:)

dumblob commented 10 years ago

I think the proposed change is a must - I've been thinking about "UTF-8 rules them all" while having built-in support for WCS handling already more than once, but wasn't convinced it's right time to open up this issue :)

Night-walker commented 10 years ago

String representation is exactly what I have been thinking over recently. You started the issue, so now you'll have to bear with me trying to push in my grand ideas -- you brought it on yourself :)

I have a simple idea. Kick out wide strings, entirely.

The problems of wchar_t in C/C++ inherited by Dao:

it's not portable, you cannot rely on it to have same size and encoding
it's not guaranteed to be able to represent any Unicode code point
it's inherently inefficient becase it always requires conversion from/to external encoding

Ruby and Go have byte strings only. Rust even gurantees that any properly created string is a valid UTF-8 sequence. Not counting C and C++, I know of only Python to have a mess with Unicode support similar to (and probably because of) wchar_t. Virtually all other modern languages have a single string representation.

It seems uncomfortable to not being able to treat single string element as a character. But in practice there are very few cases where it can be a problem. And it is quite easy to provide some means to work with a byte string on character-wise basis:

for loop utilizing mbtowc() to traverse strings
same for code section methods
a function to get the array of Unicode code points from a string, and vice versa
various other minor stuff to work with characters, like getting Nth char, appending a char, etc.

Finally, the internal handling of strings in the core and modules will be much simpler and easier to maintain. And the strings will be simpler for the user to reason about as well.

P.S.

@Night-walker I may need your UTF-8 encoder for this:)

Just don't forget to park it back in my garage when you're done :)

daokoder commented 10 years ago

OK, it seems everyone has an issue with string. So let's do something about it:)

Removing WCS completely? This option had been on my radar before, you just remind me to reconsider it again:) I had thought wchar_t works the same on different platforms, apparently I was wrong. Also, WCS turns out to be only occasionally useful so far, removing WCS sounds more reasonable now. With WCS, the code size should reduce considerably (I have always wanted to cut down something:) ).

for loop utilizing mbtowc() to traverse strings

mbstowcs() won't be necessary, I still prefer to the assumed/default encoding of strings to be UTF-8 and handle it accordingly and automatically. Other encodings have to be handled explicitly. This should be a more portable and consistent approach.

Night-walker commented 10 years ago

for loop utilizing mbtowc() to traverse strings mbstowcs() won't be necessary

Not mbstowcs(). mbtowc(), to step one MBS character forward. However, if you want UTF-8 to be the presumed encoding of any string, then it all becomes simpler. UTF-8 can be traversed in any direction starting from any byte.

It may also make sense to convert local encoding to UTF-8 automatically when reading from a file. Ideally, encoding should be bound to the file stream upon its creation, so that all the strings you obtain from it are already in UTF-8.

By the way, any local encoding conversion in pure C will require critical sections with setlocale() to be implemented properly.

Night-walker commented 10 years ago

The original encoder:

inline char FormU8Trail( uint_t cp, int shift )
{
    return ( ( cp >> 6*shift ) & 0x3F ) + ( 0x2 << 6 );
}

static void DaoEnc_EncodeUTF8( DaoProcess *proc, DaoValue *p[], int N )
{
    DString *str = p[0]->xString.data;
    DString *out = DaoProcess_PutMBString( proc, "" );
    if ( str->mbs ){
        DaoProcess_RaiseException( proc, DAO_ERROR, "String already encoded" );
        return;
    }
    for ( daoint i = 0; i < str->size; ){
        wchar_t ch = str->wcs[i];
        uint_t cp;
        if ( sizeof(wchar_t) == 4 ) // utf-32
            cp = (wchar_t)ch;
        else { // utf-16
            if ( ch >= 0xD800 && ch <= 0xDBFF ){ // lead surrogate
                if ( i < str->size - 1 && str->wcs[i + 1] >= 0xDC00 && str->wcs[i + 1] <= 0xDFFF ) // trail surrogate
                    cp = ( ( (uint_t)ch - 0xD800 ) << 10 ) + ( (uint_t)str->wcs[i + 1] - 0xDC00 );
                else
                    goto Error;
                i++;
            }
            else // bmp
                cp = (wchar_t)ch;
        }
        if ( cp < 0x80 )            // 0xxxxxxx
            DString_AppendChar( out, (char)cp );
        else if ( cp < 0x800 ){ // 110xxxxx 10xxxxxx
            DString_AppendChar( out, (char)( ( cp >> 6 ) + ( 0x6 << 5 ) ) );
            DString_AppendChar( out, FormU8Trail( cp, 0 ) );
        }
        else if ( cp < 0x10000 ){   // 1110xxxx 10xxxxxx 10xxxxxx
            DString_AppendChar( out, (char)( ( cp >> 12 ) + ( 0xE << 4 ) ) );
            DString_AppendChar( out, FormU8Trail( cp, 1 ) );
            DString_AppendChar( out, FormU8Trail( cp, 0 ) );
        }
        else if ( cp < 0x200000 ){  // 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
            DString_AppendChar( out, (char)( ( cp >> 18 ) + ( 0x1E << 3 ) ) );
            DString_AppendChar( out, FormU8Trail( cp, 2 ) );
            DString_AppendChar( out, FormU8Trail( cp, 1 ) );
            DString_AppendChar( out, FormU8Trail( cp, 0 ) );
        }
        else
            goto Error;
    }
    return;
Error:
    DaoProcess_RaiseException( proc, DAO_ERROR, "Invalid code unit found" );
}

Night-walker commented 10 years ago

A remark: since double-quoted "" string literals become vacant, they may be utilized as a short form of verbatim strings or so.

dumblob commented 10 years ago

What I thought about was separation of string from byte array completely, but providing means to convert and cast them seamlessly (just like byte array being another encoding). I want to have a possibility to work with strings as easily a it's an array of bytes.

Night-walker commented 10 years ago

MBS string, which is going to be the only string representation, is actually just a byte array. Nothing will have to be changed in the existing stuff like indexing a string or getting its size, so it will continue to be just an array of char.

There is nothing to separate then. Means can be added to convert a string to array of Unicode characters and vice versa, but it is not exactly about encoding.

daokoder commented 10 years ago

@Night-walker, thanks for the encoder!

since double-quoted "" string literals become vacant, they may be utilized as a short form of verbatim strings or so.

I was actually considering to use '' for char, and "" for string, like in C/C++.

@dumblob

What I thought about was separation of string from byte array completely, but providing means to convert and cast them seamlessly (just like byte array being another encoding). I want to have a possibility to work with strings as easily a it's an array of bytes.

Of course, you can work with strings as easily as with arrays of bytes. A string is essentially an array of bytes, and supports operations that allow it to be used like an array. I don't see why you believe it is not convenient enough.

dumblob commented 10 years ago

So for example if all strings will be UTF-8 by default, I suppose 'čřžď'[3] will return the last codepoint, but what if I want to treat it like byte-array for my own parsing (e.g. because I know the string was read from some file with corrupted UTF-8 and I want to handle errors myself)? How would I access individual bytes?

Night-walker commented 10 years ago

So for example if all strings will be UTF-8 by default, I suppose 'čřžď'[3] will return the last codepoint, but what if I want to treat it like byte-array for my own parsing (e.g. because I know the string was read from some file with corrupted UTF-8 and I want to handle errors myself)? How would I access individual bytes?

It will be byte indexing, and slicing will still work byte-wise, and string size will be returned in bytes, and etc. For working with characters, there will be additional means.

However, it's questionable how for-in should treat strings -- as byte sequences or as character sequences? Go, for instance, iterates strings character-wise in this case. But not Rust, which has a special iterator for that, while defaulting to byte-wise handling.

dumblob commented 10 years ago

Well, if almost everything will stay byte-wise, then the for-in loop should be definitely also byte-wise. I'm though curious how the means for treating characters (codepoints) will look like :)

I really like the simplicity of indexing, slicing, for-in etc. and would like to use it also for characters, but it would imply introduction of another type "codepoint string" or switching the default string manipulation from byte-wise to character-wise and introduce type for byte array. Assignment between the "codepoint string" and "byte array" wouldn't be allowed, but casting would behave like reinterpret_cast from C++ and each of these two types would have the same interface methods for conversion (to_byte_array, to_codepoint_string).

Night-walker commented 10 years ago

Look at Ruby or Rust strings. Something like str.char(i) and str.iterate_chars {...} is trivial to support. In most cases, however, you need searching, regex-matching and slicing with byte indexes obtained from those searches and matches. Also take into account that string parsing is virtually always done on ASCII basis, i.e. only ASCII characters are relevant. All this makes it quite safe to assume that any character equals byte.

There are really very few cases when you have to index multi-byte characters or iterate them one by one.

In fact, I also had doubts about how safe and convenient it is to work with UTF-8 strings. But after inspecting some use cases I concluded that it's a rather exceptional case when a multi-byte character can compromise code correctness or require special handling.

dumblob commented 10 years ago

I must disagree as a non-ASCII-country based citizen. I usually use regexps (and any other searching and string-handling) with character strings and I'm really disappointed if it's not available (e.g. GNU coreutils, grep, sed, awk...). In these days more and more data gets internationalized and the need for out-of-box support for such data enormously grows.

Therefore I'm not fully satisfied with the OOP-like approach (i.e. having methods for character-wise handling) for treating majority of our input data (for example take random data from web - you'll get most of them in UTF-8 or other internationalized encoding, not in ASCII any more :( ).

daokoder commented 10 years ago

I am just considering another alternative: instead of removing wide character string, why not use unsigned int in place of wchar_t? Since the encoding and decoding of UTF-8 can be done without using any library functions. This will solve the portability issue of wchar_t, isn't it?

Just tested, removing WCS does not cut down the binary size much, just about 20K (2.4%). Considering that there is still need to implement character-wise methods, the reduction will be even less. So removing WCS, is not a big win as I hoped. But keeping the option of WCS will certainly make it more convenient to handle non-ascii texts. (Currently, the Dao help module also relies on WCS, by the way).

Night-walker commented 10 years ago

I must disagree as a non-ASCII-country based citizen. I usually use regexps (and any other searching and string-handling) with character strings and I'm really disappointed if it's not available (e.g. GNU coreutils, grep, sed, awk...). In these days more and more data gets internationalized and the need for out-of-box support for such data enormously grows.

When I said about ASCII-based parsing, I didn't meant that it is only suitable for ASCII input. I pointed out that in most cases only ASCII characters are essential as "anchors" in string handling. Other characters, regardless of their size, are simply passed over. For instance, XML parser doesn't care about anything which is not <, &, DOCTYPE, etc. -- it thus doesn't have to care about character size at all.

Overall, I can hardly imagine a typical case in which one can make an erroneous assumption about character size. Only something like using hard-coded non-ASCII string literals in the source together with their hard-coded sizes, etc.

dumblob commented 10 years ago

@daokoder

This will solve the portability issue of wchar_t, isn't it?

Maybe I miss something, but how does it solve the fact, that wchar_t is compiler-specific in-memory representation of all existing codepoints (i.e. how will the platform-specific w functions work with differently defined wchar_t)?

@Night-walker

For instance, XML parser doesn't care about anything which is not <, &, DOCTYPE, etc. -- it thus doesn't have to care about character size at all.

If the input for such XML parser is in UTF-8, then this statement holds. In other encodings it depends (i.e. it's false).

Overall, I can hardly imagine a typical case in which one can make an erroneous assumption about character size.

Slicing? Indexing?

I'm starting to be more and more convinced the need for string type being always an array of UTF-8 characters (codepoints) and bytearray type being always an array of bytes is inevitable. Both with the same slicing, indexing... operators and both having convenient conversion methods (str2ba(), ba2str()) as mentioned in one of my comments above. The source code will be always automatically converted to UTF-8, the "" will represent string and '' will represent bytearray.

Night-walker commented 10 years ago

But keeping the option of WCS will certainly make it more convenient to handle non-ascii texts

I doubt it will. It is almost always possible to process text in MBS in the same way as in WCS. If you have a realistic counter-examples, then just dispel my delusion, for I didn't find any such cases which I would call typical.

I am just considering another alternative: instead of removing wide character string, why not use unsigned int in place of wchar_t? Since the encoding and decoding of UTF-8 can be done without using any library functions. This will solve the portability issue of wchar_t, isn't it?

I thought about that too -- UTF-16 strings, as in Qt, Java and .Net. But that would only add extra complexity for the internal use of such non-C/C++-compatible strings, while they may appear not as useful as they may seem to.

Note that having wide strings doesn't prevent character-related errors. One can easily make a false assumption about the form of some string at some point, e.g. simply forget to convert it to WCS. Having only Unicode strings would solve this, but that is doubtfully an option for Dao.

Relatively new independent languages seem to avoid wide strings. Including Ruby, Go and Rust which are all aimed to be used for various web stuff. There is even an apparent tendency towards UTF-8, which can't exist without a reason.

dumblob commented 10 years ago

I thought about that too -- UTF-16 strings, as in Qt, Java and .Net.

.Net supports wchar_t in all variants (namely 8b, 16b and 32b) IIRC.

There is even an apparent tendency towards UTF-8, which can't exist without a reason.

Yep and I think my proposal with string and bytearray distinct types meets this need.

Night-walker commented 10 years ago

For instance, XML parser doesn't care about anything which is not <, &, DOCTYPE, etc. -- it thus doesn't have to care about character size at all. If the input for such XML parser is in UTF-8, then this statement holds. In other encodings it depends (i.e. it's false).

It always holds regardless of the encoding, because only ASCII characters constitute XML markup -- and they have identical codes in any sane encoding including UTFs and local 8-bit encodings.

Overall, I can hardly imagine a typical case in which one can make an erroneous assumption about character size. Slicing? Indexing?

These are not use cases. That's operations based on one or two indexes which are obtained from somewhere else -- usually from searching or matching. Only hard-coded values could lead to errors, but such case is rather unlikely even if you search/match a multi-byte string.

Night-walker commented 10 years ago

.Net supports wchar_t in all variants (namely 8b, 16b and 32b) IIRC.

But internally all strings there are always 16-bit.

Yep and I think my proposal with string and bytearray distinct types meets this need.

UTF-8 string is a byte array. It is not only redundant, but also quite inefficient to make index-based string operations character-wise, if that is what you imply. It is simply not a viable solution, as all indexing, slicing, index-based searching and matching would have O(n) worst performance.

daokoder commented 10 years ago

It is almost always possible to process text in MBS in the same way as in WCS. If you have a realistic counter-examples, then just dispel my delusion, for I didn't find any such cases which I would call typical.

Character access by index cannot be done efficiently with MBS. But this is probably not a very typical use case.

But that would only add extra complexity for the internal use of such non-C/C++-compatible strings,

Actually not much extra complexity.

One can easily make a false assumption about the form of some string at some point, e.g. simply forget to convert it to WCS.

This may be an issue. Having two forms of strings demands extra attention when dealing with strings passed from somewhere else. At certain point, a programer may simply forget to check and handle them properly.

After more consideration, it seems the advantage of removing WCS should outweigh the advantage of keeping it.

dumblob commented 10 years ago

and they have identical codes in any sane encoding including UTFs and local 8-bit encodings.

Exactly, considering e.g. 10 most used encodings, only first few (ASCII, UTFs and maybe one or two others) are sane, which is what I wanted to emphasize - i.e. I have no idea how many encodings are used in China (daokoder, any specifics?), Japan, Arabic countries etc. which play inherent role in IT technology. So, the sanity shouldn't be considered as it's not a factual measure, but rather a hope :)

These are not use cases.

What are those use-cases? I couldn't come up with anything else then operations with characters or methods (search, match...).

It is simply not a viable solution, as all indexing, slicing, index-based searching and matching would have O(n) worst performance.

Compared to the solution with methods for each such operation it's even worse - _O(n + overhead_of_callingmethods). If we openly say to programmer "hey, there is bytearray with all it's efficiency" and "there is also a UTF-8 string, but with O(n) operations", he'll decide what to use where and when to convert between them.

If UTF-8 string is treated like byte-array we provide him with "half-integrated" support for character strings (part of the usage would be with operators and the rest with methods). We'll allow him to modify UTF-8 anywhere on a byte-wise basis, which'll lead to flawed UTF-8 string etc.

Night-walker commented 10 years ago

and they have identical codes in any sane encoding including UTFs and local 8-bit encodings. Exactly, considering e.g. 10 most used encodings, only first few (ASCII, UTFs and maybe one or two others) are sane, which is what I wanted to emphasize - i.e. I have no idea how many encodings are used in China (daokoder, any specifics?), Japan, Arabic countries etc. which play inherent role in IT technology. So, the sanity shouldn't be considered as it's not a factual measure, but rather a hope :)

I can clarify it for you :) There is ASCII, there are local 8-bit encodings, there is UTF-7/8/16(BE/LE)/32, there are few local variable-width encoding. But I don't know of anything which is not backward-compatible to ASCII. By saying "sane" I mainly wanted to exclude some 50+ years old standards which could predate or compete with ASCII.

What are those use-cases? I couldn't come up with anything else then operations with characters or methods (search, match...).

"Operations with characters" is not a task. What's the goal? What should be accomplished and why this particular way? That's what I would call a use-case.

Compared to the solution with methods for each such operation it's even worse - O(n + overhead_of_calling_methods). If we openly say to programmer "hey, there is bytearray with all it's efficiency" and "there is also a UTF-8 string, but with O(n) operations", he'll decide what to use where and when to convert between them.

There is no practical reason why a string should use character-aware indexes for basic operations. It can just use byte indexes, retaining both the simplicity and efficiency.

If UTF-8 string is treated like byte-array we provide him with "half-integrated" support for character strings (part of the usage would be with operators and the rest with methods). We'll allow him to modify UTF-8 anywhere on a byte-wise basis, which'll lead to flawed UTF-8 string etc.

You can put garbage into a string In virtually any language. And here UTF-8 is actually beneficial in that such an act can be detected in time. So it's another point added to UTF-8 score :)

dumblob commented 10 years ago

I can clarify it for you :)

Ok, if you're so sure, I'll call you when I get into trouble with encodings in Dao sometime in the future :) Btw looking at http://en.wikipedia.org/wiki/GB_18030 (and the corresponding "See also" paragraph) makes me sure I'll contact you very soon :)

What's the goal?

Find all words containing ď starting from the fiftieth character. Count characters in a word. Find all names starting with Žď. Print all non-ASCII characters from string. Etc.

Basically one really needs character-wise handling nearly everywhere. Conversely, I can't think of any use-case where I'm interested in the underlying representation rather than it's meaning.

It can just use byte indexes, retaining both the simplicity and efficiency.

For efficiency there is bytearray or any other similar vector-structure with fixed-size elements (e.g. DataFrame). I really don't want to use UTF-8 string neither like str::get_nearest_meaningful_character_direction_right(str[50]) nor like str.get_character_on_index(50).

And here UTF-8 is actually beneficial in that such an act can be detected in time.

Such detection is inherently O(n) => not feasible doing it before (or during) each string operation.

Night-walker commented 10 years ago

Ok, if you're so sure, I'll call you when I get into trouble with encodings in Dao sometime in the future :) Btw looking at http://en.wikipedia.org/wiki/GB_18030 (and the corresponding "See also" paragraph) makes me sure I'll contact you very soon :)

Like UTF-8, GB18030 is a superset of ASCII. Just as I said.

Find all words containing ď starting from the fiftieth character.

Starting from 50th character? Not realistic. Maybe starting from some position, but not from Nth character.

Count characters in a word

OK, that's fair. But it's a matter of calling something like char_count() or iterating via iterate_char {...} -- no need to take character size into account.

Find all names starting with Žď.

If you don't make assumption that 'Žď' occupies 2 bytes, i.e. you don't use hard-coded literals with their hard-coded sizes, no problem. Even with a low-level approach:

pat = 'Žď'
str = '<some text>'
pos = 0

while (pos = str.find(pat, pos), pos > 0){
    io.writeln(str[pos : pos + %pat - 1])
    pos += %pat
}

Print all non-ASCII characters from string.

Virtually the same as above:

pat = '[^ABCDEF...]'
str = '<some text>'
pos = 0

while (match = str.match(pat, pos), match != none){
    io.writeln(str[match.start : match.end])
    pos = match.end + 1
}

Basically one really needs character-wise handling nearly everywhere.

But it doesn't mean that one has to use some special, explicitly character-wise handling everywhere. It works fine without it.

For efficiency there is bytearray or any other similar vector-structure with fixed-size elements (e.g. DataFrame). I really don't want to use UTF-8 string neither like str::get_nearest_meaningful_character_direction_right(str[50]) nor like str.get_character_on_index(50).

You won't have to do that because it's absolutely meaningless.

And here UTF-8 is actually beneficial in that such an act can be detected in time. Such detection is inherently O(n) => not feasible doing it before (or during) each string operation.

But there is no need to make it before any string operation. Character-wise operation will inevitably detect it, obviously with zero overhead. For byte-wise operation it doesn't make sense to care about characters.

Let me be a smart-ass a bit, if I may.

I spent hours in considering all possible variants of revising strings in Dao. I inspected other languages with regard to string handling (particularly UTF-8), read various discussions, manifestos, cries of pain, documentation and historical notes regarding encodings and text representations and multilingual support and whatever else.

UTF-8 is not my favorite string representation -- I tend to like UTF-16 more. But having only UTF-8/byte strings in Dao is:

generally simpler and safer than having two kinds of strings, given the implicit dualism of string
a lot more practical than having only wchar_t strings, given their inherent flaws
much simpler then having only UTF-16 strings, given high reliance of Dao on C and C++

If you think one can't live without wide strings, look at Ruby. It has only byte strings. Default operations work byte-wise. There are additional methods to (rarely) work with characters in an explicit way. Ruby exist for two decades, it works, and a lot of people are using it, and it's used extensively on the Web.

There is no real problem in handling non-ASCII text via byte strings. At least I don't see anything proving the opposite.

daokoder commented 10 years ago

I have been thinking about an approach to support fast access to characters by index, it is O(1) in the best cases and O(n) in the worst. This approach attaches an auxiliary array to the string when it is accessed by char index, and stays there as long as the string is not modified. This array store pairs of numbers (may be in short ints), where the first indicates the width (in bytes) of a char, and the second stores the number of continuous occurrences of the char. In the best cases, there are only a few pairs of such numbers, so the byte location of a char of certain index can be computed efficiently. In typical cases, there may be more, but shouldn't be many, computing the indices should still be a lot faster than O(n).

The down side is that, even if a string is never accessed by char indices, each string will still require at least 12 bytes (for 64 bits machine, 8 bytes for 32 bits) more space (two short int fields and one pointer field). Another downside is, in order to use short int fields, it may only be able to support string of a couple tens of thousand characters in the worst cases. But worst case scenarios are usually not a big concern.

Of course, another obvious approach is to pre-index all the chars before char access by index. This way a single access will be guaranteed to be O(1). But it clearly take a lot more space than the above approach, and is not feasible to store it along with the string. And if the pre-indexing is to be done each time, it would be just too expensive. Though when accessing all the chars in a single loop, this is the preferred approach, but the common scenario for this is to access all chars from the first to the last sequentially, then there is no need for any kind auxiliary array.

So my approach should be much more preferable for general cases. My only concern is the extra 12 bytes space for each string, actually only 4 bytes more with respect to the current string data structure, maybe not a big deal.

Night-walker commented 10 years ago

I wouldn't worry that much about accessing individual characters. It just doesn't happen as "I want Nth char", which I have been trying to show.

I specifically spent time on trying to find a case when one may actually need to jump over N characters forward or backward, in order to prove that byte strings are inconvenient and dangerous, which was what I believed in. I didn't find any realistic, typical case.

It is surprising, but almost all real-world tasks on byte strings cannot be compromised by multi-byte characters. You still should better not be careless about such strings, but that applies to dual (MBS/WCS) strings as well. Maybe even more so, as the only way to ensure that you have a wide string in Dao is to check it manually for any routine parameter and returned value -- which is considerably more cumbersome and error-prone than simply knowing that all strings are always MBS and don't making any assumptions about character size.

daokoder commented 10 years ago

I wouldn't worry that much about accessing individual characters. It just doesn't happen as "I want Nth char", which I have been trying to show.

Worry or not is not an issue, I am just trying to consider possible options.

BTW, the base overhead in my approach is actually 8 bytes, so the string structure would have the same size as now.

Night-walker commented 10 years ago

Well, if you intend to leave all basic string operations byte-wise, then there is nothing to argue. It is of course feasible to provide character cache, as far as it is used only when the user explicitly uses something like str.char(i), that is issues an inherently sequential-access character-wise operation.

dumblob commented 10 years ago

I'm not so much concerned about the overhead in the string structure, but rather the operations itself. I like UTF-16 the most for inner-representation (as it's fixed-size => extremely fast and simple to handle, but requires more conversions from/to the real world). The downside is slightly bigger memory footprint which could in theory make some problems on embedded systems which Dao is targeting as well. Anyway, I'd like to see a really consistent approach - If I know that the representation of string holds characters, I want to work with characters. If I know the representation of data are bytes, I want to work with bytes. Nothing more, nothing less.

If you do a study whether bytes or characters are used more, you should conside only sane attitudes of programmers and not the existing implementations, because they apparently encage programmers minds. So the question would rather be "Would you be in favour of accessing strings character-wise or byte-wise while having the possibility to convert between the character-wise and byte-wise representation whenever needed?".

I'm a big fan of UTF-8 for interfacing, but the memory-representation is something different. We've made an exploration on strings when designing memory-representation for Fog framework library (multiplatform, extremely fast vector&bitmap graphics library). Also other big frameworks (Qt etc.) use UTF-16 for inner representation.

Also keep in mind if we need to change the inner representation some time in the future, I'd rather make it character-wise strings and byte-wise byte arrays distinct from the user point of view.

daokoder commented 10 years ago

Well, if you intend to leave all basic string operations byte-wise, then there is nothing to argue. It is of course feasible to provide character cache, as far as it is used only when the user explicitly uses something like str.char(i), that is issues an inherently sequential-access character-wise operation.

Sure, that's my intention.

I like UTF-16 the most for inner-representation (as it's fixed-size => extremely fast and simple to handle, but requires more conversions from/to the real world).

UTF-16 is not fixed size. It just happens that most text has characters encodable with single code units.

Night-walker commented 10 years ago

If I know the representation of data are bytes, I want to work with bytes.

You don't "work with bytes" in case of any string which is actually used as a textual data rather then a low-level buffer. With UTF-8, you still think about characters just as for UTF-16 and any other text encoding, and that doesn't mean you require some special means to handle characters just because bytes and characters are not 1:1.

Special character-aware indexing is useless 99.99% of the time. For all that 99.99%, you can use byte indexes to work with character data. There is nothing terrible in it because those indexes don't appear from nowhere.

"Would you be in favour of accessing strings character-wise or byte-wise while having the possibility to convert between the character-wise and byte-wise representation whenever needed?".

I guess I just used the wrong words, which caused a confusion. There is simply no such question as how you want to access a string. A string is a collection of characters, not bytes. If you use a string to work with bytes, it's an array of integers which has nothing to do with the textual meaning of string.

Now, to access character data within a string, you may use indexes. But they do not have to be character indexes, because a task like "get Nth character" is simply not realistic. They can just be byte indexes. It will work just fine because those indexes are obtained from functions which work with character data. That's why you don't have to care much about what that indexes are. You can see that in my examples of string processing -- I did not have to take into account character size or its representation or encoding or whatnot.

Also other big frameworks (Qt etc.) use UTF-16 for inner representation.

"Big frameworks" uses UTF-16 becase:

they mostly appeared in the time when the sky was higher, the sun was brigther and Unicode could all be put into 16 bit
they are big and don't rely much on petty C standard library, so extra character conversion when interfacing OS API is a tiny issue, not even an issue when you have automatic destructors and other high-level stuff

dumblob commented 10 years ago

UTF-16 is not fixed size. It just happens that most text has characters encodable with single code units.

You're right. The only difference is much higher probability of less jumping in memory.

@Night-walker Well, ok, let's uncover the underlying internal representation to programmer. I'm though sure that I'll write myself a String class ASAP with all the operators etc. working codepoint-wise and I'll use it extensively instead of the default string.

Basically what you're proposing is the former gAWK attitude - the regexps were locale-aware, but everything else was just plain C-byte-array handling. And it was a really bad decision. Fortunately in cca 2008, the original POSIX behavior was resurrected and everything is byte-wise, i.e. without support for any locales.

Night-walker commented 10 years ago

Well, ok, let's uncover the underlying internal representation to programmer. I'm though sure that I'll write myself a String class ASAP with all the operators etc. working codepoint-wise and I'll use it extensively instead of the default string.

Why? What's the point? I don't know of any language which has such strings. Neither wchar_t, nor UTF-16 is simply not a code point string. Yes, you can assume that in typical UTF-16 each element is a character, but that doesn't give much in practice.

I just don't see any real ground for byte strings to be inconvenient for Unicode handling, even though they seem inconvenient. I don't know of any realistic task where the fact that a character may occupy more then one string element can be a problem. Even though it feels uncomfortable to not being able to treat elements as characters, the code does not change in 99.99% of cases.

dumblob commented 10 years ago

the code does not change in 99.99% of cases.

As I said earlier, I'm working with non-ASCII all the time and it feels really unconfortable to always have it mind and write algorithms everywhere to mangle with bytes instead of characters. I have one really huge project in my mind and it'll be written in Dao, so we can measure after few years if I rather use string or String.

daokoder commented 10 years ago

I found this wikipedia page is quite relevant to our discussion here: http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings, which mentions (without citation source):

A common misconception is that there is a need to "find the nth character" and that this requires a fixed-length encoding;

which is what @Night-walker has been arguing about.

It also mentions:

Fixed-size characters can be helpful, but even if there is a fixed byte count per code point (as in UTF-32), there is not a fixed byte count per displayed character due to combining characters.

and,

That said, programs that mishandle surrogate pairs probably also have problems with combining sequences, so using UTF-32 is unlikely to solve the more general problem of poor handling of multi-code-unit characters.

Given our discussion above and taken into consideration of the points listed here, I believe choosing the UTF-8 is the right choice. Of course, current methods and operators could be generalized or new methods and operators could be added to operate on per character basis.

dumblob commented 10 years ago

Fixed-size characters can be helpful, but even if there is a fixed byte count per code point (as in UTF-32), there is not a fixed byte count per displayed character due to combining characters.

Sure, but this has nothing to do with inner representation and programmer`s interface to such representation.

I believe choosing the UTF-8 is the right choice.

I also believe UTF-8 is the right choice for inner representation for Dao, but what I argue is the programmers interface. I'm absolutely OK with saying "ourstringis just a low-level byte array, each operation works with bytes and there are no exceptions to this rule; we provide built-in explicit means to convert UTF-8 input data into our byte array (i.e.string) usingstring::conv_from_utf8(any_vector_data)method; to work with characters you have to do everything yourself, but there is a built-in typeutf8_twhichs operators % [] etc. work on character-basis and you can use casting (utf8_t)some_string with O(1) to convert back and forth"

But I'm not OK with saying "our string is UTF-8, but all operators work like it's not UTF-8 string, but rather a low-level byte-array like in C; we provide you though with funny auxiliary methods like Ruby to pollute your code with method calls to substitute the string-interface forcing you to work with characters messily (e.g. regexps will not work on byte-array basis, but those funny methods will; so you'll have to write your own regexp engine to work with ASCII to use byte-wise matching & counting etc.)".

In go, the situation is the same as I described in the first paragraph, but there is no character-wise type though. The second approach I described in the second paragraph is a Ruby-like one. Dao is high-level (like ruby), but not so verbose while being small & efficient (like go). That's the reason I want to distinguish between two types - a byte-array and a high-level character-wise string.

Night-walker commented 10 years ago

There is no need for a separate string type just to have some 3 additional operations which are very rare to use: get Nth char, get char count, iterate through chars. I can hardly imagine a case when one would have to use these operations. Any additional container would simply cause confusion.

Character-based indexes, size in characters, etc. are inefficient to work with (O(n) in the worst case for any encoding including UTF-16/32) and are virtually meaningless in the context of normal string handling.

daokoder commented 10 years ago

Any additional container would simply cause confusion.

I agree on this, having a bytearray type would confusing a number of things, for example, should it be a string or bytearray when reading a file?

However, I wonder if it would be helpful to add a subtype of string for UTF-8, so that the string stay as it is for everything, except that operations such as size %, for iteration, slicing []' and '[]=, will become character based for UTF-8 strings by default. It would also be helpful to enforce the sanity of UTF-8 encoding.

The question is, is it really necessary? Having examples of critical use for such operations would be helpful.

Night-walker commented 10 years ago

I think such subtype would not be very helpful for a simple reason. It is neither simple nor convenient to switch from one type to another when handling string. Moreover, keeping in mind string processing flow, such subtype would not solve the problem of occasional need to access a multi-byte character.

A mind experiment. You're processing a string. At some point, you need to check what the next character is related to some position in the string. Now, for some reason you are required to interpret that character as a Unicode character, which is rarely the case but can happen nevertheless.

For instance, XML specification allows certain Unicode chars to be used in names. In order to check whether you have a valid name char after <, you will have to get the whole code point. That is a case I consider more realistic then the others (albeit even here you don't actully have to descend onto the level of individual characters). And in this case a string subtype won't help you, because you don't know the character index. Even if you knew, accessing it like that would be nonsensical -- it's right here at your grasp, but you traverse the whole string up to the current position to get to it.

The most trivial solution is str.match('%w', pos) with pos being your current position. That certainly seems not as simple as str[pos], so the first thought is "UTF-8 just lost one score point". In fact, you don't need to check characters one by one in a language like Dao, I would have always used str.match('[<name start chars>][<name chars>]*', pos) to check all the name in one go.

What was that all about? Even if you have (or you think you have) to dig out individual characters, it won't be like "get Nth char" or "get char count", so such string subtype would not help much.

Moreover, I wouldn't rely on that the programmer knows what's what about encodings and characters. Thus we should not give him the tools he might misuse because of lack of knowledge in this domain. What would the user think upon encountering some special "UTF-8-proof" string? That he should always use it for UTF-8? For Unicode? For any text? It leaves vast room for misinterpretation.

When the user asks for a tool to assemble a chair, don't give him a tool box. Just hand him a screwdriver.

Night-walker commented 10 years ago

I see you added 5- and 6-byte sequences to UTF-8 encoder. You must have been misled by that scheme in the Wikipedia article.

The original specification covered numbers up to 31 bits (the original limit of the Universal Character Set). In November 2003, UTF-8 was restricted by RFC 3629 to end at U+10FFFF, in order to match the constraints of the UTF-16 character encoding. This removed all 5- and 6-byte sequences, and about half of the 4-byte sequences.

daokoder commented 10 years ago

A mind experiment. You're processing a string. At some point, you need to check what the next character is related to some position in the string. Now, for some reason you are required to interpret that character as a Unicode character, which is rarely the case but can happen nevertheless.

I see you point, a subtype wouldn't help indeed.

I see you added 5- and 6-byte sequences to UTF-8 encoder. You must have been misled by that scheme in the Wikipedia article.

Right, thanks for pointing it out.

dumblob commented 10 years ago

Generally I'm not satisfied with two things:

we assume Dao users will almost never need to parse/handle non-ASCII text on their own (and if they do, they should write another module for Dao in other language than Dao) while also assuming that these users, who want to parse something non-ASCII (i.e. pretty much everything today), need Phd. in "encoding hell" (i.e. they're able to write their parsers using only byte-wise handling).

I personally don't have such Phd. and therefore I'd like to read input data, convert them to some properly defined type (e.g. string) and don't care any more what type of data I'm working with (i.e. don't care any more about the underlying structure), but care only about the interface (operators/methods).
making convenient operators useful only for handling packet headers, but not for any non-English text (I want to be really super-sure my colleagues and me won't write an error only because we don't have a Phd. from "encoding hell" or we've slept badly and forgot to use my_string.get_char_at(50) /a tool box/ instead of familiar my_string[50] /screwdriver/.

Btw I've looked at my small Dao script (775 lines of code) and tried to find places where character-wise and byte-wise matters (i.e. indexing and working with file-names etc.) and I was surprised, because I really found these: path[-%icon-%suffix:-%suffix-1], if (i[-1] == '/'[0]) {, get_icon_full_path(icon, fpath[0:-%fname-1]), prefix + 'submenu =', i[0][:30], prefix + 'submenu =', i[0][:30] (and maybe some other places I didn't have time to investigate right now) where the indexes/numbers are meant as character-count and not byte-count.

Yes, this example is slightly biased, because a third of this script process strings. I wrote this script few months ago while keeping in mind that the input must be in ASCII and waiting for a right time to raise this question about strings in Dao. This is a real-world example.

I just seek consistency and a high-level approach (with a strict possibility to check programmer`s errors - if I cast some variable, I see it explicitly that from that on I'm working with another type with different meaning) and hiding all the underlying implementation - therefore the proposal of two types.

Night-walker commented 10 years ago

we assume Dao users will almost never need to parse/handle non-ASCII text on their own (and if they do, they should write another module for Dao in other language than Dao) while also assuming that these users, who want to parse something non-ASCII (i.e. pretty much everything today), need Phd. in "encoding hell" (i.e. they're able to write their parsers using only byte-wise handling).

Again, you won't have to use some special means for handling strings. There are very few cases when doing str[i] for non-ASCII text is wrong, and all those cases should always be clearly visible because they are explicitly required by something. For everything else, you can work with strings as if nothing happened.

my_string.get_char_at(50)

I already said it about dozen times. Such operation is totally useless, it's never really needed, I think I wouldn't even provide such a method at all.

Btw I've looked at my small Dao script (775 lines of code) and tried to find places where character-wise and byte-wise matters (i.e. indexing and working with file-names etc.) and I was surprised, because I really found these: path[-%icon-%suffix:-%suffix-1], if (i[-1] == '/'[0]) {, get_icon_full_path(icon, fpath[0:-%fname-1]), prefix + 'submenu =', i[0][:30], prefix + 'submenu =', i[0][:30](and maybe some other places I didn't have time to investigate right now) where the indexes/numbers are meant as character-count and not byte-count.

I don't see anything suspicious here. All this code will work correctly with multi-byte characters. The only thing worth a remark is [:30] by which you apparently meant "30" printed symbols. You can't do it that simply even in UTF-16 and UTF-32, simply because there are, for instance, accute letters which consist of two code points. And I don't think they are that rare to naivly ignore them. By the way, the word "naive" can use accute 'i' even in English, did you know? And I've seen that more then once. Thus the approach with hard-coded size won't work with any string in any language, with Dao not being an exception here.

I have a code in Dao which does heavy parsing of code (possibly in UTF-8) with all kinds of operation on strings. It was written on multi-byte strings without caring about character size at all. I specifically inspected it some time ago to find cases which could lead to errors with multi-byte character handling. There wasn't even a slightest possibility for such mistakes to appear.

I understand you worries, I had them too. I still have occasional thoughts about what if I missed some specific case of string handling. But in all typical situations which I've considered so far it "just works".

dumblob commented 10 years ago

All this code will work correctly with multi-byte characters.

How should it work if I don't know how many bytes occupies the last codepoint (-1)?

You can't do it that simply even in UTF-16 and UTF-32

We're not discussing underlying encoding, but the programmer`s interface. Btw you can do that, because acute letters are always postfix and therefore if I cut it, I'll get a valid character (which is not true if I cut some byte from the codepoint-encoding itself) which'll be, and now the surprise comes, very similar to the one with acute :).

By the way, the word "naive" can use accute 'i' even in English, did you know?

No I didn't. But it looks really cool - I'll start using it :)

I have a code in Dao which does heavy parsing of code (possibly in UTF-8) with all kinds of operation on strings.

I'm aware of that and I'm really pleased you've done it and how you've done it.

It was written on multi-byte strings without caring about character size at all.

I know it, but I didn't run any tests against it yet, so I can't talk about any edge cases from the specification.

I understand you worries, I had them too. I still have occasional thoughts about what if I missed some specific case of string handling.

These sentences describe precisely my feelings :( - I should do the Phd. in encodings hell :)

daokoder commented 10 years ago

How about supporting the following:

string[i,:] or string[i,]: to get the i-th character (as string);
string[i,j]: to get the j-th byte of the i-th character (as int);
string[i,j:] or string[i,:j]: to get some byte(s) of the i-th character (as string);

It's fully compatible with the current syntax (string[i] for single byte access and string[i:j] for byte based slicing) without any potential problem (as far as I can see).

dumblob commented 10 years ago

Well, the idea looks feasible. How about just providing some prefix or postfix character for codepoint-wise handling?

byte-wise: str[i]; str[i:j]; %str; for (x in str);
codepoint-wise: $str[i]; $str[i:j]; %$str; for(x in $str)

daokoder commented 10 years ago

$str conflicts with the current syntax for enum/symbols.

dumblob commented 10 years ago

My bad. Which free ASCII characters do we have? &?

daokoder / dao

Handling of encodings of Dao strings #174