daokoder / dao

Dao Programming Language
http://daoscript.org
Other
199 stars 19 forks source link

Handling of encodings of Dao strings #174

Closed daokoder closed 10 years ago

daokoder commented 10 years ago

I have been considering to change a few things about Dao strings. You know, Dao can store a string either as Multi-Byte String (MBS) or Wide Character String (WCS), and support automatic conversions between them. Currently, when converting MBS to and from WCS, a MBS string is assumed to be encoded with system encoding.

I am considering to change the assumed encoding to UTF-8, this will not only make conversion between MBS and WCS potentially a lot more efficient, probably also make Dao programs more portable.

Basically the following will be done:

These should have covered most scenarios that require handling of string encoding. If I missed something, please add.

@Night-walker I may need your UTF-8 encoder for this:)

daokoder commented 10 years ago

& and * are available, but they just look bad.

dumblob commented 10 years ago

Still much beter than nothing :(

Night-walker commented 10 years ago

All this code will work correctly with multi-byte characters. How should it work if I don't know how many bytes occupies the last codepoint (-1)?

if (i[-1] == '/'[0]) will work regardless of the character at i[-1] simply because of the ASCII-compatibility I was talking about.

We're not discussing underlying encoding, but the programmer`s interface. Btw you can do that, because acute letters are always postfix and therefore if I cut it, I'll get a valid character (which is not true if I cut some byte from the codepoint-encoding itself) which'll be, and now the surprise comes, very similar to the one with acute :).

You're right in that you'll get a valid character instead of row of rhombuses with question signs. But such division of letter parts will not be valid from the point of the language used.

I have a code in Dao which does heavy parsing of code (possibly in UTF-8) with all kinds of operation on strings. I'm aware of that and I'm really pleased you've done it and how you've done it.

How do you know about it? I've never disclosed it, nobody but me have seen that code :) You must have assumed a C-written module for Dao like XML parser, right? I was talking about experimental Dao script for handling internal documentation in code.

Night-walker commented 10 years ago

string[i,:] or string[i,]: to get the i-th character (as string);

A good idea if you meant "character at position i" which would be very convenient. I already tired of arguing about the uselessness of getting Nth character.

string[i,j]: to get the j-th byte of the i-th character (as int);

Same remark as above.

string[i,j:] or string[i,:j]: to get some byte(s) of the i-th character (as string);

This is probably not needed (and too tricky). The first two operators should be enough.

How about just providing some prefix or postfix character for codepoint-wise handling?

Too obtrusive and confusing for such a seldom needed feature. I like bi-indexing more, it's very clear about what it does.

dumblob commented 10 years ago

if (i[-1] == '/'[0]) will work regardless of the character at i[-1] simply because of the ASCII-compatibility I was talking about.

Sure, but as I said - I've written it while keeping in mind that I'm not working with codepoints - in this case I really needed the last codepoint and not ASCII byte.

But such division of letter parts will not be valid from the point of the language used.

The language doesn't matter - I just need a valid output disregarding the language. Handling languages has nothing to do with programmers interface to strings (but rather to words/sentences - i.e. much higher level) and its underlying representation.

You must have assumed a C-written module for Dao like XML parser, right?

Yes, I have :)

How about the & character as prefix? Any better ideas?

Night-walker commented 10 years ago

if (i[-1] == '/'[0]) will work regardless of the character at i[-1] simply because of the ASCII-compatibility I was talking about. Sure, but as I said - I've written it while keeping in mind that I'm not working with codepoints - in this case I really needed the last codepoint and not ASCII byte.

It doesn't really matter here, as the the result will be the same :)

The language doesn't matter - I just need a valid output disregarding the language. Handling languages has nothing to do with programmers interface to strings (but rather to words/sentences - i.e. much higher level) and its underlying representation.

Handling of languages matters when you make user interface, which is virtually the only case when you may need N characters from a string. This issue can only be completely resolved by providing means to classify characters, though it still should be possible to just get N code points when you don't care that much about correctness.

How about the & character as prefix? Any better ideas?

As I said, I like bi-indexing as it's semantically very befitting for this case. Any magical character prefixes just create visual clutter.

Night-walker commented 10 years ago

By the way, I see that DString_AppendWChar() now presumes that the encoding of the target string is UTF-8. Does this mean that all strings are supposed to be in UTF-8 then?

dumblob commented 10 years ago

It doesn't really matter here, as the the result will be the same :)

How about other codepoints? How would the programmer without Phd. in "encoding hell" know that the particular byte '/' (or any codepoint - i.e. tuple of bytes) will never conflict with any of the byte subtuples in any of valid codepoints (also those not existing yet) such that he could write something like if (str1[-1] == my_codepoint)?

Handling of languages matters when you make user interface

I'm absolutely sure that the abstraction in programs with user interface always provides some wrapper and in such case one works with the output of such wrapper for further processing and first then outputs the result to a user interface. Therefore I'd presume the usage of Dao`s string interface will be for writing such wrappers and using them, i.e. for example for writing those classifiers you've mentioned or any other output stuff.

As I said, I like bi-indexing as it's semantically very befitting for this case.

This approach doesn't conflict with arbitrary prefixing of strings with e.g. & - Dao could provide both means (actually, it would be really helpful to see which of these approaches is used more to correct our estimations about programmer`s friendliness of these solutions).

Any magical character prefixes just create visual clutter.

Nobody forces you to use it :)

Night-walker commented 10 years ago

How about other codepoints? How would the programmer without Phd. in "encoding hell" know that the particular byte '/' (or any codepoint - i.e. tuple of bytes) will never conflict with any of the byte subtuples in any of valid codepoints (also those not existing yet) such that he could write something like if (str1[-1] == my_codepoint)?

ASCII character cannot conflict with any byte of any multibyte character. Non-ASCII characters are very unlikely to appear in the form of hard-coded literals, and it's hard to miss such a case.

As I said, I like bi-indexing as it's semantically very befitting for this case. This approach doesn't conflict with arbitrary prefixing of strings with e.g. & - Dao could provide both means (actually, it would be really helpful to see which of these approaches is used more to correct our estimations about programmer`s friendliness of these solutions).

Prefixing does not have any advantages over bi-indexing, so I don't see a reason for supporting it. I have seen neither of these approaches used anywhere.

Any magical character prefixes just create visual clutter. Nobody forces you to use it :)

And what about someone else's code, ah? I don't have magical glasses which would allow me to read it the way I want.

dumblob commented 10 years ago

Non-ASCII characters are very unlikely to appear in the form of hard-coded literals, and it's hard to miss such a case.

I know I'm tedious :), but where did you get the idea that str1 and my_codepoint (from if (str1[-1] == my_codepoint)) hold hard-coded literals and simultaneously it's easy to catch this case?

Prefixing does not have any advantages over bi-indexing

Well, we have at least 3 different places where we work with codepoints/bytes (str[i]; str[i:j]; %str; for (x in str);). Bi-indexing addresses only one of them. Second, I did understand the bi-indexing like the i is really Nth character and not a position of byte in s whereas s[i,:] would return the codepoint byte-sequence of the codepoint containing the the byte on index i. But let's wait for @daokoder to clarify it.

I have seen neither of these approaches used anywhere.

Dao is new with many new ideas. Besides this is the basic principle of evolution - you do something what you've never seen. On the other hand nobody has seen everything - maybe you've missed this somewhere (there are plenty of languages for work with strings => I think this approach must already exist).

Night-walker commented 10 years ago

I know I'm tedious :), but where did you get the idea that str1 and my_codepoint (from if (str1[-1] == my_codepoint)) hold hard-coded literals and simultaneously it's easy to catch this case?

Because originally it was str[-1] == '/'. Adding const my_codepoint = '/' does not change anything. Declaring my_codepoint a constant non-ASCII character is clearly visible in the code, such act can hardly be missed then you will just need to use string comparison for it. As well as when you get that string (not a code point, which would be quite weird) in the run time. The content of str is irrelevant in this case.

Well, we have at least 3 different places where we work with codepoints/bytes (str[i]; str[i:j]; %str; for (x in str);). Bi-indexing addresses only one of them. Second, I did understand the bi-indexing like the i is really Nth character and not a position of byte in s whereas s[i,:] would return the codepoint byte-sequence of the codepoint containing the the byte on index i. But let's wait for @daokoder to clarify it.

Emm, I don't really get what you wrote here, albeit maybe I didn't make it clear myself. str[i,] only makes sense when it gets character at byte index i as a string, which is essentially the only real use case I see. By the way, I would have used this operation for ASCII characters as well in certain cases for simplicity.

daokoder commented 10 years ago

By the way, I see that DString_AppendWChar() now presumes that the encoding of the target string is UTF-8. Does this mean that all strings are supposed to be in UTF-8 then?

No, it does not even assume the string that is being appended to is UTF-8 encoded. It simply assumes that the integer parameter represents a UNICODE codepoint. Obvious, an integer cannot be represented by single byte, so it has to be assumed to be something, UNICODE codepoint seems a reasonable choice.

Second, I did understand the bi-indexing like the i is really Nth character and not a position of byte in s whereas s[i,:] would return the codepoint byte-sequence of the codepoint containing the the byte on index i. But let's wait for @daokoder to clarify it.

I originally intended it to be index of character. But @Night-walker, is probably right, it is better to mean the character at (around) a byte index. Iterating through characters can still be convenient, something like this would be sufficient:

for(i=0,n=%str,ch=''; i<n; i+=%ch){
    ch = str[i,:]
}

This would also be more consistent with the current syntax. Though we can still support character access by character index with a method, just in case it may really be useful for cases we are not aware of.

dumblob commented 10 years ago

As well as when you get that string (not a code point, which would be quite weird) in the run time.

This was my point - I can imagine many ways how to get a run-time value into a variable :). Starting from getting a codepoint (i.e. sequence of bytes representing one codepoint) from fetch() capture() extract() methods and ending with our favourite Nth character (where N is runtime-filled). Originally it was str[-1] == '/' as a real example of the problem which is, described in natural language, getting an Nth codepoint (i.e. sequence of bytes representing one codepoint) from the end and comparing it to some other codepoint (i.e. seq...).

Anyway, do I really need to use the loop from https://github.com/daokoder/dao/issues/174#issuecomment-42183587 to get Nth character (from end/beginning) and the number of codepoints? I'm feeling like I'm in AWK 20 years ago :( (i.e. do it yourself).

Night-walker commented 10 years ago

By the way, I see that DString_AppendWChar() now presumes that the encoding of the target string is UTF-8. Does this mean that all strings are supposed to be in UTF-8 then? No, it does not even assume the string that is being appended to is UTF-8 encoded. It simply assumes that the integer parameter represents a UNICODE codepoint. Obvious, an integer cannot be represented by single byte, so it has to be assumed to be something, UNICODE codepoint seems a reasonable choice.

Wait, it obviously writes UTF-8 sequence to the destination string. If the string is not required to be in UTF-8, then something like wctomb() should be used instead, no?

Night-walker commented 10 years ago

This was my point - I can imagine many ways how to get a run-time value into a variable :). Starting from getting a codepoint (i.e. sequence of bytes representing one codepoint) from fetch() capture() extract() methods and ending with our favourite Nth character (where N is runtime-filled). Originally it was str[-1] == '/' as a real example of the problem which is, described in natural language, getting an Nth codepoint (i.e. sequence of bytes representing one codepoint) from the end and comparing it to some other codepoint (i.e. seq...).

You're still clinging to abstract examples. There isn't even any method to get a code point from string as an integer (and there is no need for that at all). When you're required to operate multibyte character, it should naturally be a string.

Anyway, do I really need to use the loop from #174 (comment) to get Nth character (from end/beginning) and the number of codepoints? I'm feeling like I'm in AWK 20 years ago :( (i.e. do it yourself).

You know, I'm starting to want to break something }:> Getting Nth code point is completely, totally, utterly, ultimately useless. That code example simply illustrated that it's nevertheless doable without extra stuff.

daokoder commented 10 years ago

If the string is not required to be in UTF-8, then something like wctomb() should be used instead, no?

Then you would be assuming it to be encoded in local encoding. Why would it be more preferable than the more portable UTF-8?

Night-walker commented 10 years ago

Then you would be assuming it to be encoded in local encoding. Why would it be more preferable than the more portable UTF-8?

Because mbtowc() will work for any encoding, not just UTF-8. In case the encoding of the string matches current locale, of course, which can be changed trivially.

Note that I will not be able to make my XML parser work with an encoding other than UTF-8 because of this issue.

daokoder commented 10 years ago

Note that I will not be able to make my XML parser work with an encoding other than UTF-8 because of this issue.

How it won't work? You can always convert the source first.

dumblob commented 10 years ago

There isn't even any method to get a code point from string as an integer (and there is no need for that at all). When you're required to operate multibyte character, it should naturally be a string.

This is a big misunderstanding. I should have more emphasized in https://github.com/daokoder/dao/issues/174#issuecomment-41784150 that those real examples were written while keeping in mind that the code will work only with ASCII characters. What that code should do is what I've explained using natural language in https://github.com/daokoder/dao/issues/174#issuecomment-42185842 :)

That code example simply illustrated that it's nevertheless doable without extra stuff.

Exactly and therefore I want a consistent (in terms of using e.g. the [] operator for slicing everywhere where you need slicing - disregarding type or content of that variable) and KISS (i.e. without extra stuff) interface for programmer :)

Anyway, I'm curious how many people will write their own wrappers for string (and I mean really directly for string and not for user-interface outputs) and how many people will complain due to string and how many people will do errors (and how many) in their code due to always having in mind where to use usual operators/control_structures [] for in %, where methods codepoint_at(int index #{negative from the end, positive from the beginning #}) and where own special loops for(i=0,n=%str,ch=''; i<n; i+=%ch).

To me my argumentation seems logic, but a bit unrealistic as I know, we three won't make failures (because of this discussion and precise knowledge about the implementation details), but most programmers around me will (and they're quite good programmers).

Now it's not that important what the string interface will do, but what is important is how it'll look like because if we want to change it in 3 years, there'll be already existing code with a combination of all those three approaches ([], codepoint_at() and for(i=0,n=%...)) and nobody likes changes :) (this'll be also one of the main reasons to immediately wrap the string type itself by everybody who'll read this discussion and get afraid of the future changes - then we could call it a readable code :)).

To be clear, I agree with you (i.e. temporarily provide codepoint interface using methods).

Night-walker commented 10 years ago

Note that I will not be able to make my XML parser work with an encoding other than UTF-8 because of this issue. How it won't work? You can always convert the source first.

It's not about the source. It's about character references which are code points in their numeric form. They need to be added to string as characters, for which DString_AppendWChar() is an apparent choice. But it obviously won't work for anything but UTF-8.

The same issue is with JSON. I already made commit which fixes it through wcrtomb().

Night-walker commented 10 years ago

To me my argumentation seems logic, but a bit unrealistic as I know, we three won't make failures (because of this discussion and precise knowledge about the implementation details), but most programmers around me will (and they're quite good programmers).

Generally, the rules are simple. All characters can be treated as strings, but ASCII characters may also be used as scalar values. You can always just handle character-related stuff via strings if you don't want to remember the difference.

dumblob commented 10 years ago

You can always just handle character-related stuff via strings if you don't want to remember the difference.

I would do this even if the character-wise handling is implemented as I have absolutely no need to use scalar values, but only reference them from the string interface :)

Hey, I'm curious how'll my current and future Dao code change - if it's too much, I'll definitely write some wrapper, use it (like I don't have any knowledge about underlying encoding) and compare to not using it to get real data to publish and discuss again new ways how to handle the disclosed issues (yes, Im so naíve /notice the acute/).

Night-walker commented 10 years ago

Hey, I'm curious how'll my current and future Dao code change - if it's too much, I'll definitely write some wrapper, use it (like I don't have any knowledge about underlying encoding) and compare to not using it to get real data to publish and discuss again new ways how to handle the disclosed issues (yes, Im so naíve /notice the acute/).

I already assured you that all code like those examples you brought would not need to be changed. Dao makes it simpler to work with string patterns most of the time; and when you descend onto the level of individual characters, it's normally something like that if (str[-1] == '/'[0]) of yours which cannot be a problem regardless of the content of str. When you handle a multibyte character, it's normally a string as there is simply no ordinary way to get it as an integer code point.

For example, Dao don't even have character functions like isdigit() and isalpha(), so you can't mistakenly write if (isalpha(str[i])) -- you will most likely write if (str.match('^%w', i) != none) or if (str[i,].match('%w') != none), or use explicit character ranges like A..z which makes it hard to accidentally mishandle Unicode.

By the way, I mistook about "naive". I've seen it with diacritic 'i', "naïve", which appeared to be a single Unicode code point. There are still quite a lot of languages which can use acute letters, including English.

Night-walker commented 10 years ago

What's the purpose of those DString_LocateCurrentChar() calls in DString_Trim(), DString_Chop(), etc.? The loops look quite strange.

daokoder commented 10 years ago

Given the location of a byte, return the location of the first byte, if the given byte is part of a valid UTF-8 encoding char.

Night-walker commented 10 years ago

Then what's this?

for(i=0; i<self->size; ++i){
    ch = self->chars[i];
    if( DString_LocateCurrentChar( self, 0 ) == 0 ) break;
}

It's being constantly called with the same start while the string does not change.

daokoder commented 10 years ago

That's a typo there, the 0 parameter should have been i. Thanks for spotting it.

Correction, the 0 after == should also be i.

Night-walker commented 10 years ago

DString_AppendWChar() still in question.

daokoder commented 10 years ago

Right. But I think DString_AppendWChar() should stay as it is now, because the default should assume UTF-8. Though we can add another method such as string::append( char : int, type : enum<local,utf8> = $utf8 ), to support explicit appending of characters in local encoding.

Night-walker commented 10 years ago

I think it's better to just make a generic routine which turns an integer code point to its textual representation, i.e. string. Similarly, there might be a need to do the opposite -- for validation purposes. For instance, writing an XML parser requires both.

I would put these routines into a separate module or namespace, just because these operations have a very specific field of usage. And also to separate them from usual string handling in order to prevent their misuse.

daokoder commented 10 years ago

Currently I am trying to simplify some standard methods and even remove some if necessary in an effort to reduce some size of the kernel. Putting these routines into a separated module is preferred. Are you interested in making a such module?

Night-walker commented 10 years ago

Yes, I was considering a module to handle specific Unicode-related tasks.

daokoder commented 10 years ago

Great.

Night-walker commented 10 years ago

NOTE: You need to overhaul string patterns engine, because currently it assumes that one byte is one character. You need to change string indexing to _wchart fetching and substitute is* functions to isw* analogues.

daokoder commented 10 years ago

I know, I just haven't got time for it.

Night-walker commented 10 years ago

By the way, string comparison should also be character-based, as it's supposed to be lexicographical.

daokoder commented 10 years ago

This will also be on my TODO list.

daokoder commented 10 years ago

UTF-8 support in string pattern matching and character based string comparison are done.

Night-walker commented 10 years ago

I wonder if UTF-16 surrogates can lead to erroneous matches on Windows. DString_DecodeChar() returns a code point, while functions like iswalpha() accept wint_t, which must be just a widened wchar_t (in order to represent WEOF). Decoded surrogate pair is a 32-bit value, while isw* functions probably just check the lower word for wchar_t value, the result of which is hard to predict.

Since isw* functions handle only single wchar_t elements and are stateless, it is safe to assume that code points represented as surrogate pairs in UTF-16 should not be matched by any of these functions (at least on Windows). Thus you can just create a simple wrapper for DString_DecodeChar() in the regex engine which returns L'\0' in case of a surrogate on Windows.

Night-walker commented 10 years ago

And these functions are not present on Windows: isideogram(), isphonogram().

Night-walker commented 10 years ago

I don't like DString_CheckUTF8() checks in string.convert(). They may easily cause short strings be be left unchanged because it is presumed that below 10% of invalid characters is normal. Conversion function should not make assumptions -- it should do what the user asks. If the string does not happen to be in the encoding presumed by the user, then it's a logical error, and it should be exposed as earlier as possible rather then hidden.

daokoder commented 10 years ago

Decoded surrogate pair is a 32-bit value, while isw* functions probably just check the lower word for wchar_t value, the result of which is hard to predict.

I don't know if the bit patterns of surrogate pairs in UTF-16 are recognizable. If this is the case, the isw* should be be able to handle it properly (I mean these functions would return 0 on either word). It seems to me this should be the case, otherwise, an entire UTF-16 text could be easily messed up with single erroneous encoding unit.

Thus you can just create a simple wrapper for DString_DecodeChar() in the regex engine which returns L'\0' in case of a surrogate on Windows.

It is probably better to do this, to be on the safe side.

And these functions are not present on Windows: isideogram(), isphonogram().

Now iswideogram() is replaced with dao_cjk(). For iswphonogram(), I don't know if there is a simple alternative. I also wonder if characters satisfying iswphonogram() are a subset of characters satisfying iswalpha(). Anyway dao_cjk() is probably sufficient to be practically useful.

I don't like DString_CheckUTF8() checks in string.convert(). They may easily cause short strings be be left unchanged because it is presumed that below 10% of invalid characters is normal. Conversion function should not make assumptions -- it should do what the user asks. If the string does not happen to be in the encoding presumed by the user, then it's a logical error, and it should be exposed as earlier as possible rather then hidden.

Right, I will change that.

Night-walker commented 10 years ago

I don't know if the bit patterns of surrogate pairs in UTF-16 are recognizable. If this is the case, the isw* should be be able to handle it properly

They should handle wchar_t properly (wint_t is just wchar_t cast to int without any effect on surrogate pairs), while your DString_DecodeChar() returns code point. So you should definitely catch the surrogates and turn them into zeros.

Now iswideogram() is replaced with dao_cjk(). For iswphonogram(), I don't know if there is a simple alternative. I also wonder if characters satisfying iswphonogram() are a subset of characters satisfying iswalpha(). Anyway dao_cjk() is probably sufficient to be practically useful.

I wonder why do you need to match phonograms/ideograms/CJK separately from alphabetical characters.

daokoder commented 10 years ago

I wonder why do you need to match phonograms/ideograms/CJK separately from alphabetical characters.

The problem is iswalpha() and other functions except iswideogram() do not work on CJK characters. But %w should match characters of any language.

Night-walker commented 10 years ago

The problem is iswalpha() and other functions except iswideogram() do not work on CJK characters. But %w should match characters of any language.

Are you sure? It should work.

It is also strange that you used DString_CheckUTF8() in DaoValue_Print() -- it is an identical case.

The default (presumed) encoding of Dao strings is now de facto UTF-8. It's thus simpler, more efficient and safe to not guess the encoding, so the user can always know that strings are always treated in consistently and predictably. Working with any other encoding should just be a matter of explicit string::convert() after reading and before writing to a text stream.

daokoder commented 10 years ago

Are you sure? It should work.

On Linux, I guess. But not on Mac OSX. I did seem to remember it worked before, I don't remember if it was only on Linux.

The default (presumed) encoding of Dao strings is now de facto UTF-8. It's thus simpler, more efficient and safe to not guess the encoding, so the user can always know that strings are always treated in consistently and predictably. Working with any other encoding should just be a matter of explicit string::convert() after reading and before writing to a text stream.

You forget that terminals do not alway support UTF-8. So when printing out a string, conversion may be necessary, otherwise, you won't be able to display some strings properly with simple io.writeln().

Night-walker commented 10 years ago

On Linux, I guess. But not on Mac OSX. I did seem to remember it worked before, I don't remember if it was only on Linux.

Naturally, there is iswideogram() and friends on BSD systems including Mac OS X. On other systems just use iswalpha().

You forget that terminals do not alway support UTF-8. So when printing out a string, conversion may be necessary, otherwise, you won't be able to display some strings properly with simple io.writeln().

Printing to a terminal is an edge use case, albeit it's still the same as writing to a file. It's better to know exactly what encoding will be assumed rather then trying to guess how it will behave depending on what you have in a string. All these possible implicit changes are quite error-prone and hard to diagnose.

Night-walker commented 10 years ago

For DMBString_AppendWCS(), use standard MB_CUR_MAX macro instead of MAX_CHAR_PER_WCHAR -- it will be more space-efficient. Also I'd suggest to differentiate string API for working with local encoding -- it's currently not apparent from the function names what's what.

daokoder commented 10 years ago

Also I'd suggest to differentiate string API for working with local encoding -- it's currently not apparent from the function names what's what.

Which APIs are you talking about?

Night-walker commented 10 years ago

Those functions aren't exported, so it doesn't matter actually. I meant DMBString_AppendWCS() and DWCString_AppendMBS().