UTF-8 -> CP437 implementation

ghost commented 2 years ago

Attempted implementation to solve #27.

I saw your YouTube video and I figured I'd take a crack at trying to solve the issue. There are quite a few layers to this, so I'll try to be concise about my thinking and approach towards a solution.

SQUNICODE isn't helpful

Like yourself, my first attempt was to use "#define SQUNICODE" as outlined in: http://www.squirrel-lang.org/squirreldoc/reference/embedding/build_configuration.html.

Rather than using "L" and array conversions, I just tried type-casting everything until the compilation errors stopped, and I ended up at the same result. Inevitably, I came to the conclusion that this symbol has no practical use, for a couple reasons. For one, wchar_t in reality is just the storage type for UTF-16. UTF-16 is very very rarely used, and was really only ever used by Microsoft (ironically) in what was likely an attempt to create a standard for displaying all Unicode characters (it didn't work). Secondly, and I figured this out just by printing some extended Latin characters, the issue isn't in the code's ability to handle Unicode. I found this table during some searching: https://www.utf8-chartable.de/unicode-utf8-table.pl, and it proves that every Unicode can in fact be represented in UTF-8 within 4 bytes, which is what Squirrel/C++ uses by default. If you use std::cout with std::hex on the extended Latin characters, you'll get values that match what appears in the table. In summary, this means the issue isn't with Squirrel itself, but rather within this engine. Good news is that makes the problem actually solvable.

The actual problem

As you're probably aware, CP437 spritesheets seem to only work up to character 127. The reason for this is because of ASCII. If you view the character set: https://en.wikipedia.org/wiki/Code_page_437, and compare that to Unicode -> UTF-8 table in the above link, you'll see that both have the same values representing characters 0x20 through 0x7E. The reason being that both Unicode and CP437 are compliant with ASCII, for backwards compatibility.

Now if we move over to text.cpp, which is where the character is drawn from the extacted sheet, this is where the problem all comes together. When executing "c = (int)text[i] - start;", the typecast on text[i] converts it to an integer. For characters 0x20 through 0x7E, everything is great, because Unicode is compatible with ASCII, and CP437 is also compatible with ASCII. Specifically it works because the character extract from (int)text[i] matches the index into the CP437 character set. However, once you move beyond those bounds is when you start going out of sync between UTF-8 Unicode and CP437, and the values diverge.

A solution

So therefore, my solution is changing the string text that enters the function. Rather than being in UTF-8 Unicode, the characters stored in the string need to be in CP437, so that when "c" is extracted from the string, it represents an index in the CP437 character set as opposed to representing the character in UTF-8. There's the hard way to do it, which would be to use bitwise operations to determine the character and some form of lookup table to convert it into CP437, but fortunately I found a wrapper for iconv.h: https://github.com/unnonouno/iconvpp (may be worthwhile to double-check the license). iconv.h is a very handy library that does the hard part for you, and the wrapper just makes it even easier. So after converting the string to CP437, the intended effect of fetching the current character is accomplished, as it properly maps to the character set in the spritesheet. I tested with some random CP437 sheet in SuperTux, and it seemed to work as far as extended Latin is concerned.

This isn't quite the end of the issue however, as when using the len() function from Squirrel (such as in the main menu), it will multi-count extended characters due to the multi-byte representation in UTF-8. This results in off-centered text and inappropriate spacing. So I defined a bind that uses the iconv wrapper to convert the string, and the resultant string contains the correct number of characters within it.

Limitations

The main limitation is that you're limited to what appears in the CP437 character set, as any characters that exceed said set will throw an error from the iconv wrapper (this can be disabled), so any international characters that appear in Unicode but NOT in CP437 won't work. But I figure this is at least a start, as at least a standard CP437 spritesheet will work.

And that's it

Hopefully that all makes some sense, as my head's already spinning enough between all these character sets. Needless to say, It's a scary world to dive into. Feel free to ask any further questions.

KelvinShadewing commented 2 years ago

I see what you're saying. I honestly thought CP437 was in Unicode, which is why I was using the sprites for testing. Every other bitmap font I've seen either uses it or is limited to ASCII. So if I merge this in, it'll lose Unicode support? I'd prefer to keep Unicode so that at least I can bring back SDL_ttf or make bitmap fonts with all the necessary characters.

ghost commented 2 years ago

Well, if you do merge it in, it'll have about as much Unicode support as it already does, at least from my perspective. The iconv wrapper will throw errors for incompatible Unicode characters passed to it (aka characters NOT in CP437), but that can be disabled. This error-checking only matters in the events where the wrapper is called, that being when the text is drawn from the bitmap and in the custom string-length bind. Outside of that, you can store whatever characters you want anywhere as long as they don't get passed through the wrapper while error-checking is enabled. If it's disabled, then the fallback in the else-statement applies, where the unsupported characters are represented as spaces.

Past that though, I think a better way to consider this whole thing is there are two different elements at play when considering how Unicode is "supported":

First being that Unicode text at all is able to be stored and moved across the code. Since I was able to get the special characters on a bitmap to display in SuperTux, that already proves that the characters are successfully going from the game to the engine in Unicode. So that's not a concern, and that degree of support doesn't change through merging in the iconv wrapper.

The second, and where the key issue really lies, is indexing into the stored bitmap from a string of text in text.cpp. More specifically, it's in how your code takes each Unicode character, stored as UTF-8, and maps it to an index on the bitmap, which typically is ordered by CP437. As-is the code just does an integer cast of each character and uses that as the index, which works for ASCII just fine, but breaks beyond those bounds due to the inherit differences in indexing. The role of the wrapper is to trivialize the process by converting each character in the string into the index for the bitmap prior to the for-loop iteration, so that when you perform the integer cast it will properly line up with the index in a CP437 bitmap.

Ultimately this all depends on which way you want to go with the problem, as I personally view this PR as merely a stopgap solution. One way or the other, when you're adhering to CP437 bitmaps, there will be a limitation to the Unicode support on the second point. Basically, anything past extended Latin can still be stored, but won't be able to be drawn, as CP437 has no index for those characters. Just adding on more characters to the bitmap won't cut it, as you still have to index into those characters from the string of text. You could cut out CP437 support in favor of SDL_tff, but then you're compromising the simplicity of creating/using a bitmap that "just works". If it was my call, I'd try to "extend" CP437 with additional bitmaps that support additional characters so that you get the best of both worlds. I'll spare myself an even higher word count, but my idea would involve creating some form of lookup table that does the work in-place of the iconv wrapper, so that then you can map certain special characters to whichever index on the bitmap(s) that's loaded in.

ghost commented 2 years ago

Closing in favor of a much better solution that I crafted up. Will re-submit as a new PR.

KelvinShadewing / brux-gdk

UTF-8 -> CP437 implementation #28