Unicode support. - Githubissues

Jookia commented 11 years ago

(Haha yes of course I would talk about this, but this has actually affect people's decisions in using GWEN.)

This isn't one specific issue, but kind of a few but they revolve around a few decisions and implementations.

wchar_t and char usage

So last time I talked about this, I've been an avid wchar_t hater. I still don't use it, but GWEN is designed around it, so it's all fine. What does bother me is how the API uses a mix of wchar_t and char. For instance, in Platform.h we have a mix of String& and UnicodeString&. We have Unicode clipboard text (woo!) but no Unicode window titles. It's not consistent.

I'd suggest moving entirely to wchar_t, and converting internally. The only place I'd see useful for char is UTF-8 APIs like SDL, Allegro which take char* for all their stuff, which is internal. The alternative is using entirely char UTF-8, which sucks on Windows, but I have heard that BlackPhoenix has an entire UTF-8 port of GWEN.

Current conversion code.

Simply put: The current conversion code is broken. It's easily fixable by pulling in a header-only library and using that. This actually affects things right now like Allegro's UTF-8 API not being used properly when drawing text.

garrynewman commented 11 years ago

Are you sure this is still an issue? It should use TextObject everywhere now - which should give us the best of both worlds?

https://github.com/garrynewman/GWEN/blob/master/gwen/include/Gwen/TextObject.h

Jookia commented 11 years ago

Ah, yes, I forgot to note that in my original issue:

TextObject is an awesome idea, but the conversion code is broken still. Also, it doesn't specify which encoding 'string' is in (UTF-8 should be what it is)

I'm not sure it's overkill, but TextObject's aren't used outside GUI stuff. A result of this is that the file dialog doesn't support Unicode files as it uses String rather than UnicodeString& or TextObject&.

garrynewman commented 11 years ago

Yeah there are a couple of places where I still need to make it use TextObject.

fuwaneko commented 11 years ago

Allegro which take char* for all their stuff

No, they don't. There is a set of Unicode functions and ALLEGRO_USTR. If font has all required glyphs you can even mix Japanese with Russian in one string for example. I patched GWEN Allegro Renderer to get Unicode support, however it relies on Windows function WideCharToMultiByte, because GWEN's UnicodeToString does not work properly on Windows (std::locale does not accept any Unicode CP). I'm too lazy to test if gcc std::locale works properly and make cross-platform patch, but I still think that it is unacceptable to rely on any system settings like std::locale with no params which returns global locale that could be, well, anything. Or at least put somewhere in docs that user must set std::global to something that makes sense.

Great UI toolkit, btw.

Jookia commented 11 years ago

It doesn't? Eh, I must've missed something then. Isn't ALLEGRO_USTR just stuff for UTF-8? Edit: Ah, I see the problem. When I write char* I meant UTF-8, as it uses chars as its data type.

Locales shouldn't be used for converting between data types.

fuwaneko commented 11 years ago

When I write char* I meant UTF-8

char is just a byte, it can contain anything, and it has nothing to do with strings, locales, etc. You are free to interpret bytes as you wish, and Allegro provides a convenient set of USTR functions and ALLEGRO_USTR "data type" to treat bytes as UTF-8 strings. The problem was that I had to convert from wchar_t* to variable-length encoded char* with correct codepage (in my case UTF-8, Allegro wants it), and then feed it to al_ustr_new.

Locales shouldn't be used for converting between data types.

No, you're wrong. We convert not just between data types, but between string representations as well, that's why codepage is required. Even wchar_t differs from compiler to compiler, not to mention that multibyte string can be in UTF-8, Shift-JIS or whatever else. Currently UnicodeToString just converts from Unicode to system single-byte encoding dropping anything that not belongs to this encoding, so it's absolutely not correctly written, as UTF-8 is a variable-length encoding. And for Microsoft compilers WideCharToMultiByte must be used, as std::locale does not accept UTF-8 in this case. It's possible to use Boost.Locale as a cross-platform way to handle it or other similar libraries.

Unicode, locales and text are not as simple as you may think, especially if you are outside of ASCII world.

Jookia commented 11 years ago

char is just a byte, it can contain anything, and it has nothing to do with strings, locales, etc. You are free to interpret bytes as you wish, and Allegro provides a convenient set of USTR functions and ALLEGRO_USTR "data type" to treat bytes as UTF-8 strings. The problem was that I had to convert from wchar_t* to variable-length encoded char* with correct codepage (in my case UTF-8, Allegro wants it), and then feed it to al_ustr_new.

I know, that's what I was saying. char was in contrast to wchar_t.

No, you're wrong. We convert not just between data types, but between string representations as well, that's why codepage is required. Even wchar_t differs from compiler to compiler, not to mention that multibyte string can be in UTF-8, Shift-JIS or whatever else. Currently UnicodeToString just converts from Unicode to system single-byte encoding dropping anything that not belongs to this encoding, so it's absolutely not correctly written, as UTF-8 is a variable-length encoding. And for Microsoft compilers WideCharToMultiByte must be used, as std::locale does not accept UTF-8 in this case.

Ugh, I'm being vague aren't I? I meant 'C++ locales shouldn't be used for converting between data types', specifically because of the way they're designed. Locales are great at showing what code pages the users are using, GB18030 is pretty much the one I use to test aside from UTF-8. But this is a GUI library and we're concerned about the API and converting between UnicodeString and String, which are typedefs of std::wstring and std::string. std::wstring is whatever the compiler wants, and right now std::string is undefined, and I'm proposing that we treat std::string in the API as UTF-8 encoded strings.

It's possible to use Boost.Locale as a cross-platform way to handle it or other similar libraries.

It is, but it's a bit overkill of what we're doing: Converting between wchar and UTF-8. utf8cpp can do this providing you know what wchar's encoding is at compile-time, which is UTF-16 on Windows and UTF-32 on Unix.

GWEN doesn't have any i18n stuff yet (think BiDi, UIM support) and Unicode support is hard enough.

Unicode, locales and text are not as simple as you may think, especially if you are outside of ASCII world.

While I live in the ASCII world, I do feel I have a grasp on what I'm coping with. In my projects I treat strings as UTF-8 everywhere and use locales to decide how to compose output and operate on input. I'm still learning, so if you have any resources I could review it'd be nice.

fuwaneko commented 11 years ago

It's a very long comment. But in short: wchar_t everywhere and let the third-party to do all conversions (remove UnicodeToString and TextObject). Renderers are also considered to be third-party and must accept only wchar_t. I can fork and try to do this in my spare time as I want to use GWEN for my project anyway and I want it to be portable.

I meant 'C++ locales shouldn't be used for converting between data types

Again: we are not just converting from wchar_t to char. We are converting between completely different string representations. wchar_t is a constant multi-byte character which is treated as 16 bit integer by Microsoft compilers and 32 bit integer by gcc. And it's not binded to Unicode at all. Compiler can do whatever it wants. Not OS. wchar_t is compiler dependent. char is just a single byte, completely data-independent and compiler-independent, it just has a confusing name. Even int can be different in size between compilers, but not char. It is always a single byte.

Now on how to store Unicode data in an array of chars. There is a thing called "variable-length encoding" (MS ambiguously calls it multibyte) and most common such encoding is UTF-8 where a single Unicode code point (google it) can be represented by 1 to 4 bytes. UTF-16 is also a variable-length encoding but it uses 1 or 2 16-bit integers. Another example would be Japanese Shift-JIS encoding which is not Unicode, so it's not limited to just Unicode. You can store UTF-8 in char array as flawlessly as any other data.

Now on single-byte encodings. You probably know that you can only represent 256 characters at most using single-byte encodings. What if you have, for example, Russian text stored in wchar_t array (Unicode) and your rendering function accepts only single-byte encoded strings? You will need a codepage. This is just a table that tells: "this byte represents this character". For Russian on Windows it is a Codepage 1251. So what do you do to convert? You use a codepage via std::locale or similar thing and it returns a byte which corresponds CP1251 for all Unicode code points from wchar_t representation. If there is no such byte it can either return some "default" character (e.g. space) or throw an error.

You absolutely need to know codepage if you convert from Unicode to single-byte. But you don't need it (obviously) if you convert between different Unicode representations. In our case I think the best way is to forget about single-byte encodings at all. They suck. And as I already mentioned, current UnicodeToString does exactly what it shouldn't: converts Unicode to single-byte forgetting about codepage.

What is the most portable way of storing Unicode strings? Variable-length encoded UTF-8 in char*. Note that UTF-8 was made with storing in mind, not processing. UTF-8 processing requires complex state machines.

What is the simplest way? wchar_t. It is just not portable across compilers. If you compile everything on all platforms using GCC — you're okay. If you do not store/load external data — you're okay. And you can always use converters.

It doesn't matter because you will always have to deal with conversions. And GWEN's conversion is just done the wrong way: it converts from Unicode to a single-byte not caring about codepage at all. While that may work on *nix with something like "en_US.UTF8" as LC_ALL, it is not portable and will not work on Windows.

How it is usually done: library sticks with some kind of internal representation. It can be anything you want, just make sure it is consistent across all the library. Anything that comes from of goes out must be converted between internal representation and what the third-party wants.

What I propose: stick with wchar_t everywhere, remove UnicodeToString completely (or implement it properly if it's too hard), require all renderers to accept wchar_t, patch AllegroRenderer so it converts from wchar_t to UTF-8 char* and then to ALLEGRO_USTR itself (utf8cpp looks nice). This way end user will have to feed GWEN with wchar_t and conversion is up to them. It's nothing wrong with it as it already works that way.

Switching to multibyte char* would be in my opinion more pain in the ass as GWEN already uses wchar_t extensively.

There are char16_t and char32_t for unambiguous UTF-16/32 representation in new C/C++ standarts but it's too early to use them.

In text drawing the most common internal encoding is UTF-32. FreeType for example expects UTF-32 for glyphs. It is also the most speed-efficient.

There are also locales with different date, currency, etc. representations. And RTL text. But as a beginning it would be good to just fully support Unicode internally. Actually I only wanted Allegro Renderer to work correctly, and look what happened :)

I'm proposing that we treat std::string in the API as UTF-8 encoded strings.

While it's possible it will render all std::string functions invalid as they treat std::string as a single-byte encoded string. That's why wchar_t is better — all standard functions just work. For std::string in UTF-8 we would need additional library.

Jookia commented 11 years ago

Again: we are not just converting from wchar_t to char. We are converting between completely different string representations. wchar_t is a constant multi-byte character which is treated as 16 bit integer by Microsoft compilers and 32 bit integer by gcc. And it's not binded to Unicode at all. Compiler can do whatever it wants. Not OS. wchar_t is compiler dependent. char is just a single byte, completely data-independent and compiler-independent, it just has a confusing name. Even int can be different in size between compilers, but not char. It is always a single byte.

I know, but I'm talking about the real world here where wchar_t has defacto encodings.

What is the most portable way of storing Unicode strings? Variable-length encoded UTF-8 in char*. Note that UTF-8 was made with storing in mind, not processing. UTF-8 processing requires complex state machines.

I wouldn't say complex, but go on.

What is the simplest way? wchar_t. It is just not portable across compilers. If you compile everything on all platforms using GCC — you're okay. If you do not store/load external data — you're okay. And you can always use converters.

char is portable and stores UTF-8 fine, and wchar_t can require complex state machines too, so it only bring downsides.

What I propose: stick with wchar_t everywhere, remove UnicodeToString completely (or implement it properly if it's too hard), require all renderers to accept wchar_t, patch AllegroRenderer so it converts from wchar_t to UTF-8 char* and then to ALLEGRO_USTR itself (utf8cpp looks nice). This way end user will have to feed GWEN with wchar_t and conversion is up to them. It's nothing wrong with it as it already works that way.

Switching to multibyte char* would be in my opinion more pain in the ass as GWEN already uses wchar_t extensively.

That's what I said.

While it's possible it will render all std::string functions invalid as they treat std::string as a single-byte encoded string. That's why wchar_t is better — all standard functions just work. For std::string in UTF-8 we would need additional library.

Which is why you convert them to internal encodings.

fuwaneko commented 11 years ago

That's what I said

Well, I'll try to make a patch.

Jookia commented 11 years ago

Thinking more about this, I think the entire API should be moved to wchar and TextObject dumped. We should then include some utility for widening/narrowing (assuming narrow is UTF-8, so there's a 1:1 conversion, just different encodings) that works with the API.

fuwaneko commented 11 years ago

I think the entire API should be moved to wchar and TextObject dumped

Yes, that's what I think too.

We should then include some utility for widening/narrowing

I'm not sure about this. I mean, it's UI toolkit, not ICU or iconv. Besides, there are stdlib functions to convert between wchar_t and multi-byte UTF8. So there is no need for reinventing a wheel.

Jookia commented 11 years ago

Besides, there are stdlib functions to convert between wchar_t and multi-byte UTF8. So there is no need for reinventing a wheel.

A light wrapper around Windows' conversions and Linux's stuff would do fine, something with a defined input and output is what I want. I haven't seen anything in the stdlib that converts between wchar_t and UTF-8.

fuwaneko commented 11 years ago

A light wrapper around Windows' conversions and Linux's stuff would do fine

I don't see any possible use of it inside GWEN. And for outside it's completely useless as well.

something with a defined input and output is what I want

I'm not sure what you are talking about. If we decide on wchar_t, then the only input GWEN accept is wchar_t, and only output it gives is wchar_t. All conversions are up to third-party (including renderers).

I haven't seen anything in the stdlib that converts between wchar_t and UTF-8

wcstombs/mbstowcs with something like en_US.UTF-8 set as locale or C++-style codecvt converters. On Windows, however, you can't use UTF-8 locale, so you have to use MultiByteToWideChar.

Jookia commented 11 years ago

I don't see any possible use of it inside GWEN. And for outside it's completely useless as well.

patch AllegroRenderer so it converts from wchar_t to UTF-8 char*

I'm not sure what you are talking about. If we decide on wchar_t, then the only input GWEN accept is wchar_t, and only output it gives is wchar_t. All conversions are up to third-party (including renderers).

I agree but it'd be nice to have a narrow/widen function bundled with GWEN for ease of use with the API. Or maybe a point in the wiki to where you can get one?

wcstombs/mbstowcs with something like en_US.UTF-8 set as locale

Wouldn't work on Windows, they don't have UTF-8 locales.

or C++-style codecvt converters.

Please no. When narrowing/widening strings we'd have to create a codecvt instance (which are horrible to code and understand) then imbue it to a stringstream then input it then return the buffer.

karanik commented 11 years ago

A bit late to the discussion here, but I have a fork that merges String and UnicodeString to use a single representation, based on wchar_t or char selectable at compile-time. The motivation is to reduce memory usage and fragmentation, and make it easier to port to console platforms which don't always support the complete c++ standard library. Also by using a single string type, it possible to replace with a custom string if necessary.

It is an extensive, albeit simple patch with still some rough edges, but if there is interest for it to be merged to Gwen proper, I'll be happy to make the necessary changes.

Jookia commented 11 years ago

That is quite the interest! I'd like to see it if possible, for critique and review.

garrynewman / GWEN

Unicode support. #3

wchar_t and char usage

Current conversion code.