Closed ivan-mogilko closed 2 years ago
the nice thing about Utf-8 is that it looks, smells, and behaves the same as a standard C string. utf8 string cannot contain null byte, except when used as string end marker, so most string manipulation functions (e.g. StrCmp, StrConcat etc, will just continue working regardless whether the input is ascii or utf-8. only when one needs to work on an actual character (or rune, as it is called in most utf-8 aware code) a simple algorithm can be used to determine how many bytes the character actually consists of.
Which brings us to the utf8 library.
here's an example decoder written by a talented coder: http://c9x.me/git/irc.git/tree/irc.c#n96 the whole utf8 code is less than 100 lines, i'd think we could just copy that into a single file rather than dealing with more libraries.
That said, we must have a new type for a unicode character, because if someone will parse string they won't want to write a utf8 processing in script. For that purpose I propose Char type, with capital "C" (similar to how string type was superceded by String).
i'd propose to call that type wchar
, analoguous to wchar_t in C/C++ to avoid confusion, or maybe rune
.
Yes, I keep forgetting about it, but thinking again now, it's worth to indicate these parts of the engine which require char by char extraction as opposed to byte by byte processing. Besides String functions that do text splitting of any kind and search for characters, font renderers is of course the first that comes to mind. Line splitting too. Maybe something else. EDIT: filepath functions (because you need to search for separators and other special parts there). But these are likely all using our String class now too.
i'd propose to call that type wchar, analoguous to wchar_t in C/C++ to avoid confusion, or maybe rune.
Hm, I'm afraid "rune" is way too unusual, and this is a scripting language for hobbyists, "wchar" maybe less so, at least it has got "char" in it. Another alternative is "uchar", as in "unicode char".
Another alternative is "uchar", as in "unicode char".
uchar is used in many codebases as a typedef to "unsigned char" - so this term might also be confusing.
@ericoporto i assume your downvote is because you already imagined merging & using your cool new utf8 library you've found?
We still have to find out which functions will be needed, it's not all clear at this point.
For example, utf8 uppercase and lowercase functions also needed, on their own and for the case-insensitive comparison. I doubt that is achievable with stricmp. Maybe there's something else too.
In this ticket I only outlined major points, but I haven't had time to look more closely. It's been too long since I experimented with utf in the past.
PS. That example library that I mentioned in the ticket is 1 header with like 20 functions corresponding to the standard C str functions.
I think Char
is a better name to use - the only "confusing" part is it alias a primitive type and so far primitive types were lowercase. But this may helps pointing something is "wrong" about our contract of accessing the String by index since it means characters now and now bytes. Char
is also the name that already was in the String Script API, so this helps breaking less things. The direct byte access can be just Byte[]
in the String Script API if it's useful, to avoid confusion, and we could "typedef" (a macro) the old char to byte in the script API when it makes sense using it to differentiate.
About the AGS String
class we can just adjust it's tests to all utf-8 and see what breaks: https://github.com/adventuregamestudio/ags/blob/master/Engine/test/test_string.cpp
Alternatively just add new utf8 specific version of the tests and hide them behind a macro so they can optionally be built to run and use these to follow the development of the string handling codebase for the unicode support.
It's also worth noting that we may need to adjust the code that does line breaks (and probably add a test for it), I remember a person had problems with linebreaks using Chinese in the forums. So it's worth checking out this too. erh apparently it's tricky: https://devblogs.microsoft.com/oldnewthing/20160307-00/?p=93122#:~:text=In%20Chinese%20and%20Japanese%2C%20there,permitted%20after%20almost%20any%20character.
About the AGS String class we can just adjust it's tests to all utf-8 and see what breaks:
Yes, I believe it's essential to do #664 prior and actually have tests running.
Found a huge can of worms trying to interact with the Windows command line: https://stackoverflow.com/questions/1259084/what-encoding-code-page-is-cmd-exe-using
(I think it can be ignored if only using the new windows terminal, which I think simply uses utf-8)
Found a huge can of worms trying to interact with the Windows command line
I don't see that as that much of a problem. Error messages of the engine and of the compiler that might be output to the console directly will be preponderantly in English, thus written in codepoints below 128. In that range, ASCII and UTF-8 are identical, and the output to the console will be okay. In the other direction, there's no urgent need to allow haceks and umlauts in variable or technical field names; so I don't see any urgent need to be able to read special characters from the command line under Windows (correctly).
Haceks and umlauts might come out jumbled visually when an UTF-8 string is output directly to the console, but that's something that's preponderantly bungled by the console. AFAIK, most all the programs currently live with the fact that the console might bungle the visual aspect of their output. It's on Microsoft to fix that by providing proper functionality, and if we program elaborate kludges to work around the current console limitations, that effort will all go to waste when Microsoft will have taken the time to issue those fixes. AFAIK Microsoft has already improved the console in other respects, so it's reasonable to expect such fixes sooner or later. That might already be by the next half-year update to Windows 10.
All in all I feel that it would be unwise to invest much effort into handling Windows console limitations with respect to UTF-8 console input or output . It's enough when UTF-8 is handled correctly within the Engine and for input/output to the game windows and when UTF-8 files are read and written correctly.
Simply document that what's written to the console or read from the console will be in UTF-8 exclusively, and let the users deal with any codepage changes necessary so that UTF-8 strings show up okay in the console.
@fernewelten my worries was in file paths, say a user has a name with a fun character. I did though found out newer versions of Windows can use UTF-8 if you set it so in it's manifest: https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page
@ericoporto the engine code already works with unicode command line arguments, here's for example how we convert them to "dos" form: https://github.com/adventuregamestudio/ags/blob/master/Common/util/path.cpp#L268
These are not utf8 though, so we will have to convert them to utf8 if that is necessary.
(But I suspect that for using WinAPI wide-char functions we need to use widestrings, I have no idea if these support utf8)
EDIT: For more information about this, here's a paragraph from WinMain article: https://docs.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-winmain#remarks
For the reference, existing allegro unicode functions work as a function pointers that are set when you change the unicode format (with set_uformat
).
You may glimpse how they are used in alfont, but for a brief explanation, when you call e.g. ugetc
you're not calling a real function of that name, but a function assigned to a pointer. Which may be e.g. utf8_getc
(see).
These functions are stored in a table (see), and calling set_uformat
actually switches the function pointers to one of the corresponding table rows (see).
The downside of this is obviously that nothing can get inlined. On the good side such approach potentially lets to support both ascii and utf8 texts in one engine more easily without adding much switches around the code. Regardless of how we end up with the ags4, this method for instance may be used as a quick solution for ags3 utf8 translation support. Or more.
PS. And, yes, you can do set_uformat
(almost) anytime, because it affects only currently running functions.
A small experiment, as suggested earlier, starting with ensuring the TTF font rendering and covering utf8-aware line splitting: https://github.com/ivan-mogilko/ags-refactoring/tree/ags4--unicode1
To test you may build a game with a regular version of AGS, create a translation and save TRS in UTF-8 (without BOM). Don't forget to use a proper unicode TTF. Because of the fixed line splitting this works even if there are no whitespaces between words, and it does not break multibyte characters anymore.
So, guess, in theory this may be suitable for a minimal unicode translation support in ags3 even. But we would definitely need a encoding hint in the compiled game data, so that engine knew when to switch modes.
That was exactly what I had in mind after looking into @mgambrell code for Ratalaika's port! If someone need a pixel font to test, GNU Unifont is a good one.
If someone need a pixel font to test, GNU Unifont is a good one.
What does "pixel font" mean in this context? May it somehow be converted to WFN?
For translation only purposes few things that should be done next are:
EDIT: in regards to backward compatibility, theoretically speaking, the program should default to utf-8, but switch to ascii if game data demands it. It's possible that we'd have to switch realtime when working with real system paths as they should remain utf-8 at all times. Maybe could be worth to expose utf-8 functions in allegro and use these explicitly when working with paths (but not yet sure how that will be convenient code-wise).
What does "pixel font" mean in this context? May it somehow be converted to WFN?
Just aesthetics, it has ttf builds available - and other formats and tools to modify it. The font has an exception clause in it's license so people can use it in non-gpl software like games.
In regards to the script String API, I believe that that during string operations it is essential to assume that string arguments are correct to current mode (that is - either correct ascii or correct utf-8). This will make our work much easier. For ags4, if we assume that the game data will be strictly unicode, then this approach is only natural. In ags3, if we support both utf-8 and loading ascii games, there may be a complicated situation when the ascii game loads up a utf-8 translation (for example, if someone wanted to add a proper unicode translation to an existing old game without recompiling it). The conflicts may occur if such ascii game has extended characters (128-255) in the game or script data - as these will not be recognized in utf-8 mode. In such case we'll have to convert game text. But I don't want to touch this subject right now.
The String methods likely have to have all indexes and lengths (in Length, IndexOf, Substring, etc) as char indexes, not byte indexes anymore, as that is what users likely to expect when they work with strings. For using String as a byte array they'd have to get underlying buffer and work with it on their own. But I hope that we may have a better alternative in script later.
The String methods likely have to have all indexes and lengths (in Length, IndexOf, Substring, etc) as char indexes, not byte indexes anymore
in C, strlen and friends work as/on byte indices. there's a different set of functions for working on glyphs, e.g. wcwidth() to get length in unicode chars.
The String methods likely have to have all indexes and lengths (in Length, IndexOf, Substring, etc) as char indexes, not byte indexes anymore
in C, strlen and friends work as/on byte indices. there's a different set of functions for working on glyphs, e.g. wcwidth() to get length in unicode chars.
Ags script is not C, and String is not a C-string, its meaning is a sequence of characters. If we keep its functions working with byte indexes then the users will have to calculate utf-8 char indexes on their own, which is definitely not what we want.
In theory it's possible to have two sets of methods for this class, one of which will work with bytes and another with wide chars, but I'm afraid that will only confuse everyone.
I think the question here is: may users of AGS script require to have both unicode and strictly non-unicode String object in a game, and what for.
In terms of use cases, the use case of a String type is to represent characters (not bytes), and provide functions to operate with searching and manipulating sequences of these characters. There's also a char array. If it's not convenient currently for some reason perhaps we could plan on expanding its support. BTW, I'm not certain about old strings, because IIRC they could be read as char array directly, in which case we better not touch these...
One "non-standard" use of String I may think of is to store values as character codes directly, basically using String as an expanding byte array, because dynamic arrays in AGS are less convenient to resize. Here we'll have different situations depending on whether we refer to old game support or new game (where presumably unicode strings is a standard). Old game case will continue to work if we don't try to convert that string to utf-8. In the new games... either we actually provide explicit methods which work with a String as a byte buffer (and we actually make it clear that they work with bytes), or we make more convenient byte arrays, and so on.
In regards to old game and backward compatibility, I'm not enthusiastic about mixing ascii and utf-8 data in one game... To elaborate again, I see following potential situations here:
Updated the experimental branch with Script API changes.
Test project (made in 3.5.1) with 1 unicode translation: test--unicode.zip
@ivan-mogilko there's a ticket for translation with unicode: https://github.com/adventuregamestudio/ags/issues/711
Yes, I wrote that ticket, but it's not a task ticket, it's more a note made for memory, linking ratalaika port.
And I cannot tell why they did the changes they did without bigger investigation. Maybe they had a different approach in mind. I did not follow their port progress to the end.
Updated the branch, now using SDL2's TEXTINPUT event to get unicode chars, and TextBox control is working (the code may be bit dirty in places).
Also created a separate experimental branch based off master, to see how it works with ags3 (likely there's no difference).
Updated with unicode paths support, most changes were necessary for Windows, but I noticed the branch also fixes this issue on linux: #662
So I think we should try to bring at least minimal necessary changes into ags3 branch.
I was not planning this at first, but seeing how allegro library has a solution for switching string mode at runtime (and alfont is using it), I'd like to try and make a PR which will add a partial unicode support to ags3 branch, while still maintaining backward compatibility with existing games.
The encoding type is essential when the string is parsed char by char, and we are not specifically looking for low-value (<=127) ascii char like latin characters or punctuation (because extended UTF characters are never encoded using values <128). By recent observation this is required in the following situations:
The basic idea is this:
One potential issue of this solution (I already mentioned it before): if the game has extended ASCII characters (>127) and loads a UTF-8 translation, then if some original strings are not translated, any extended ASCII chars in them will be not handled properly.
But imho this is a trade-off for an opportunity to create translations more easily, especially when it comes to non western european languages and more difficult to emulate with ASCII set, like Chinese.
If there's more problems which I haven't noticed, we'll have to find them...
One thing to test is savegames generated in a previous ags3 game version and loading it on the utf8 one. This is important if people upgrade a game already released and maintained (commercial games in online stores).
If this has potential for breaks, it's important to note so when this new version is released for the developers - it would still be safe for new games, which is still valuable.
and TRS->TRA compiler detecting BOM
i'd advise against that, BOM has caused a lot of trouble in the past while maintaining my linux distro; the most glaring issue is that BOM is dependent upon host endianness. i don't recall the other issues off the top of my head but i can research git history to find clues if so requested.
@rofl0r do you know if it would be easy to, at compile time, if there's a BOM, convert the encoding to utf8 without BOM? (then the engine only keeps support to utf8 without BOM)
https://stackoverflow.com/questions/2223882/whats-the-difference-between-utf-8-and-utf-8-without-bom
It's as easy as deleting (skipping) first 3 bytes of a file.
BOM is not a part of UTF-8 itself, it is simply a hint in the file's header. Text itself does not need any conversion. It matters only whether the program parses the beginning of the file correctly. Then, when TRA is compiled it includes only the pure text, so engine won't even know whether the source was with bom or not.
EDIT: Of course if engine opens some text files itself, then it might be taught to detect BOM if that's necessary.
actually, BOM doesn't make sense at all for UTF-8, it's meant for fixed-width multibyte encodings such as UCS-2, but in utf-8 the encoding is always sequential and not endian-dependent.
The UTF-8 BOM is a sequence of bytes at the start of a text stream (0xEF, 0xBB, 0xBF)
one issues caused by this (which i have encountered) is that non-bom-aware text editing tools get confused by these bytes outside ascii range, and this is probably why the unicode standard itself recommends to not use BOM with utf8.
Well, my point was, TRS may have BOM, because user saved it so (we had this happen before). So i thought maybe we might as well check for it (otherwise the TRA compilation may produce corrupt output).
btw, python solves this encoding detection problem by requiring a special comment as first line in the python file: # -*- coding: utf-8 -*-
An update to this ticket: ags3 (3.6.0) got a partial unicode support, where engine has a switch changing between ascii and utf-8 interpretation of game texts, which lets it load and use utf-8 translation files (#1321).
Personally I still believe (like mentioned in this ticket's description) that ags4 should not have any switches, but run exclusively in utf-8 mode, but this cannot be done in the engine until the editor, source project formats and and other tools fully support working in unicode and upgrading older ascii sources to unicode.
I believe this is generally resolved in 3.6.0 branch. Related PRs: #1321, #1542, #1544, #1548.
If there will be any problems or missing functionality found, these should be addressed separately.
In ags4 branch we may deprecate ANSI support at some point, but that's again a separate issue.
Only thing I thought may be possible is script string interface, if Char is uint8, we need to notedown somewhere how to deal with utf8 in the script string API.
Unicode support
This ticket's purpose is to gather a list of components in the editor and engine which will be necessary to adjust for unicode support, as well as any potential issues.
Game data
Game data contains text as byte arrays. There's nothing preventing to write and interpret these as utf8 strings. The length fields will have to keep their meaning of size-in-bytes because there's no direct conversion between length in bytes and string length in utf8.
It's preferable to remove remaining arbitrary length limits in data format, esp. if they are small, because non-latin unicode text requires more memory. What was barely enough for latin names may be not enough for non-latin ones.
We probably may need an identification of the text format in the game data. Of course we could use just a format number, but still it may be good to have text format id too for safety, and in case we will need to support variants in the future. Each data file that may contain text should have it: main game data, room file, compiled translation (TRA), compiled script too. (UPDATE: After (#1321) TRS/TRA formats support encoding hints.)
Editor and tools
As a .NET application Editor itself is already unicode compatible at least in terms of UI. The script editor is under question though and should be checked.
The most important part is link between editor's fields and native game data and serialization functions. This is where most strings are converted from unicode into byte-char ANSI strings.
Another point of interest is translation compilator. TRS source files themselves are just text, and could be saved in utf8. But current TRA compiler does not understand utf8 and writes faulty data. (UPDATE: After (#1321) TRS/TRA formats support encoding hints, which translation compiler writes and engine is taught to check.)
Font preview in the editor should be redone. Currently it simply draws first 256 characters of the font onto bitmap and displays that bitmap. That would be not optimal neither convenient for a unicode font preview.
Script format and compiler
Script compiler must support scripts in utf8. It may require an option telling whether to expect ascii or utf8 source and convert if necessary (provided this option indicates an ANSI codepage). But I think even in ascii mode the utf8 scripts (without BOM) may be maintained correctly so long as there's no unicode characters in the code itself. The texts (in double quotes) are likely viewed just as arrays of chars by compiler, so they may be accepted and written as-is.
Speaking of utf8 scripts, there's an interesting question of whether we need to also support unicode in the script names, because editor will support it in property fields by default. We could if it's not too hard to do, then users would be able to give non-latin names to the code symbols... I won't call this a priority though. But while this is not done we'll have to keep converting script names to ascii when generating script headers.
Script API
Here's the thing... currently script has a
char
type, which means 1 byte. I believe it may stay the same, because 1-byte variables may be useful, as being able to access a string byte-by-byte (interpret it as a byte array).That said, we must have a new type for a unicode character, because if someone will parse string they won't want to write a utf8 processing in script. For that purpose I propose
Char
type, with capital "C" (similar to howstring
type was superceded byString
).Char
type will likely be an alias toint
. Every API function dealing with individual text characters should be usingChar
as argument/return value. Notably -String.Chars[]
should returnChar
, and we may add a separate property for getting raw byte-char array.Engine
(UPDATE: since v3.6.0 (#1321) we have a partial unicode support, where engine is reusing allegro4 unicode support that works like a switch changing between ascii and utf-8 interpretation of game texts, so it can load and use utf-8 translation files. Much of the issues mentioned below had been addressed more or less.)
Engine is a big deal here. It has strings all over the code, many of these are bare C-strings (arrays of chars). Every operation over them should become utf8 compatible. Eventually we'll have to find and fix them all.
Which brings us to the utf8 library. IIRC Allegro 4 has unicode support which may be reused. We should still have this part intact after move to SDL2. If there are any complications or it simply is not convenient enough for us - we may use a simple utf8 string lib. For example, this one was found by ericoporto: https://github.com/sheredom/utf8.h/blob/master/utf8.h
The String class and string utilities gathered in corresponding files may be the place to start. String class is easy, we just need to replace all internal operations with utf8 counterparts.
File paths may be covered next, as most of them are not dependent on game data. On Windows we should stop using "short paths" completely (same goes for command-line arguments, which we convert to ascii). (UPDATE: in addition to #1321, Windows paths were addressed by #1385.)
Script interpreter. Assuming script format was changed to support utf8 we should expect all literal text data as utf8.
Text display. The line splitting and any other function that prepares text for display should become utf8-aware (if it does not use String class yet).
Font rendering. FreeType is supposed to support unicode, maybe there's a switch for that or not -- this is something to investigate, but I doubt we'll have much trouble with that. Alfont is slightly different issue. I know that it supports different work modes that may also depend on Allegro 4 settings. We need to investigate how this works and if it's convenient. E.g. if we ditch Allegro 4 unicode functions, we may need to adjust alfont, or, well, even replace it. Depends on what problems we unconver. Bitmap fonts are processed by the engine itself. There's not much to do there, I think, except for replacing char-by-char iteration with utf8 analogue and pass characters as int32 (or int16) indexes. IIRC current WFN format allows up to 64k character slots with gaps, so it's likely possible to even have unicode-compatible bitmap fonts. Not sure how much these will be useful in practice.
Text input. We may require unicode support for text input fields, including text parser. This may be dealt with using existing SDL2 SDL_TEXTINPUT events. I'm not much certain about whether we need or not to detect unicode symbols for a key press event, so leave this for a future consideration.
Plugins. I'd assume most of the existing plugins view engine strings as ascii, so they will have to be updated to stay in sync. Most noteable example is IAGSFontRenderer interface which deals with custom font rendering, and its implementations such as SpriteFont plugin. We may add another engine API function that tells whether engine works with ascii or utf8 strings (and provide this in ags3 branch too).
Game project upgrade
The Game.agf tells an encoding it was written in, it's located in the xml header. Thus its possible to upgrade the game contents to utf8 by converting from that encoding. Atm I don't see much trouble here, looks like we only need to write an upgrade procedure.
Old ASCII game support
I'll be honest, I'd have this on low priority and consider only after the main points of unicode support are covered, because otherwise this may slow down our work. But for the sake of reference, there seem to be two apporaches here.
Convert ASCII game data to UTF8 on load.
If that were possible and enough it will be an ideal outcome, but I'm afraid it's not going to be so simple. There are few notable problems here:
Support ASCII game data as an option in parallel with UTF8.
(UPDATE: this is practically what was done by #1321.)
This does not mean to assume ascii strings everywhere, but only when they come from the game data. Internal engine data does not have to support both (e.g. file paths). I think this approach suggests that there has to be a switch everytime we perform an operation over a game string, depending on the utf8/ascii mode. That will somewhat complicate things and make data processing somewhat slower, but it's not clear whether this will have a significant impact. Most of the string processing in AGS is done as a response to player actions and is quite limited in each game frame, compared to drawing for example.