d0k3 / GodMode9

GodMode9 Explorer - A full access file browser for the Nintendo 3DS console :godmode:
GNU General Public License v3.0
2.12k stars 191 forks source link

Unicode font support #723

Closed Epicpkmn11 closed 3 years ago

Epicpkmn11 commented 3 years ago

This adds proper handling of UTF-8 and a slightly expanded font. Specifically I added Kana (Japanese) and a bit more extended Latin as that covers most of what you see on 3DS. I could also add some other simpler scripts like Cyrillic and Greek if you want. I only changed the default 6x10 font, the other ones should all behave identically, but with the proper number of ?.

I did change the font from directly using PBM images to using a RIFF format. With the PBM images I couldn't think of a good way to handle the character mappings, you'd need to like put some kinda mapping file next to it or something which just feels like a confusing way to do it imo, and knowing the character sizes in the PBM is a lot trickier too when it's not a fixed size.

All you need to do to convert an existing font is run:

python3 utils/fontriff.py input.pbm output.riff [W] [H] -m resources/fonts/cp_437.txt

where [W] is the width of the letters and [H] is the height. cp_437.txt has the mapping for code page 437, which is what the existing fonts are in. I don't think the script needs any non-standard packages and it should work in most anything Python 3, it's only slightly broken in 2. I've only tested in 3.9.6 though.

I'm open to suggestions on a better way to handle this but I figured this seemed the simplest, it just needs a PBM image, the size of the tiles, and a txt file with the character mappings and it generates a RIFF file that has all that neatly packed in. Could maybe change the extension from just generic ".riff" though, maybe ".frf" (Font RiFf) or so.

Another potential concern is if scripts can use the character escapes, things like the arrows need to be changed from \x1B to their proper UTF-8 (). I did that for everything in GM9 itself, but it's possible some scripts might be broken by that.

Updated font: updated 6x10 font Not the prettiest Japanese font I've done lol, but for how tiny it is I think it's plenty readable.

Screenshots ![snap_210801232519](https://user-images.githubusercontent.com/41608708/127805308-f1d45855-93af-49ad-a24d-fa44c1deae5a.png) The `?` are kanji, only kana are supported. Kanji is bad enough in nds-bootstrap's 7x7 font I don't even want to think about 5x7 lol. ![snap_210801232724](https://user-images.githubusercontent.com/41608708/127805314-2d7c5f0e-94ca-4547-b30e-8195d51bbbbc.png) Titles like "Pokémon" and "LEGO® Star Wars™" also show correctly now.
Wolfvak commented 3 years ago

could you add a macro definition for something like UTF_MAX_BYTES_PER_RUNE and UTF_BUFFER_BYTESIZE(rune_count) or something? the * 4 in most buffers comes across a bit weird if you don't know about UTF and I wouldn't be surprised if someone eventually ignores it

d0k3 commented 3 years ago

Alright, I had a look. Sorry this took a few days. What I see looks already very good, and the code seems clean, too.

My thoughts:

  1. These old hex escapes (\x1B) need to keep working, even in scripts, otherwise we'll have a lot of unhappy script users very soon. Luckily, there are not many escapes that people actually use, I myself only ever used the arrows (I think). Is there some way to catch these escapes and redirect them to the correct symbol? Not necessarily for all of them, but for the most commonly used ones.
  2. Not too happy with the RIFF format, but I understand the old PBM format won't work with Unicode, and the RIFF format seems to work well. .frf may be a good extension, so these font files can be clearly recognized. Is there some way the script can just handle all these old PBMs without a provided codepage and length / width? You know, these PBMs are always 8x8 symbols, so you could determine the width / height. Maybe just try to do that when no length / width / codepage is given? It's just an idea, though.
  3. I'm not completely sure, but did you actually manage to cram all these japanese symbols into 6x10? If so, it's pretty impressive. Maybe a variable width font would be a project for the future (just thinking aloud). If you want to, you can of course add more of those missing symbols. I don't know if stuff like cyrillic or hebrew is used in title names, though.
  4. That's one I will need to doublecheck later, but did you actually catch all places were unicode symbols are used? Did you increase the size of all filename arrays? If some are missed, we're in for some nasty surprises at some point in the future.

That's it for now. Before I forget: Thanks a lot for your contribution, it's highly appreciated!

Wolfvak commented 3 years ago

Point 1 could simply be fixed by extending the ascii lut to the 0x10-0x80 range and force-mapping a few special entries

d0k3 commented 3 years ago

Yeah and about that - it's good the code uses Unicode now - please don't revert that. It's just important those escapes work in scripts.

Epicpkmn11 commented 3 years ago

These old hex escapes (\x1B) need to keep working, even in scripts, otherwise we'll have a lot of unhappy script users very soon. Luckily, there are not many escapes that people actually use, I myself only ever used the arrows (I think). Is there some way to catch these escapes and redirect them to the correct symbol? Not necessarily for all of them, but for the most commonly used ones.

Is it alright to just do 0x00-0x1F? Just added doing that, doing 0x7F-0xFF though is more problematic as most of those are also valid Unicode codepoints.

Not too happy with the RIFF format, but I understand the old PBM format won't work with Unicode, and the RIFF format seems to work well. .frf may be a good extension, so these font files can be clearly recognized. Is there some way the script can just handle all these old PBMs without a provided codepage and length / width? You know, these PBMs are always 8x8 symbols, so you could determine the width / height. Maybe just try to do that when no length / width / codepage is given? It's just an idea, though.

Changed the script to default to CP-437 and try guess the width height. 👍

I also realized it's not that hard to just leave in PBM font support, really just needs to be sorted for the binary search and have its mapping table so I also just re-added PBM font support. You'll need to convert to RIFF if you want Unicode support but now all existing fonts will still work.

Edit: Actually now that I think about it making the script default like that and re-adding PBM support kinda contradict each other's point lol, might revert the script change

I'm not completely sure, but did you actually manage to cram all these japanese symbols into 6x10? If so, it's pretty impressive. Maybe a variable width font would be a project for the future (just thinking aloud). If you want to, you can of course add more of those missing symbols. I don't know if stuff like cyrillic or hebrew is used in title names, though.

I did all of Hiragana and Katakana, the two Japanese syllabaries, I didn't do any Kanji (Chinese characters) though. So Japanese text won't be perfectly readable, but should usually be enough that you can at least tell what game titles are and such. DS game titles are even entirely Kana as well so they'll all show fully correctly, I think 3DS game titles are allowed to have Kanji though.

And yeah, the 3DS was never translated to Russian, Greek, Hebrew, etc so unless you plan on making GM9 itself translatable the only benefit to having that in the font would be like file/folder names and such which is why I left it Japanese and a bit more extended Latin only. Maybe I'll make an alternate font for the resources folder that's more complete, I think this is probably fine for the default though.

It would be kinda nice to get Korean and Chinese characters in it as the 3DS is actually translated to those languages, but those'll both require at least a couple thousand characters each and 6x10 is just too small for me to want to even think about doing either lol. Maybe I'll do an 8x10 or so extra font for them, would want to find some other font to copy for them though as while I don't mind doing a hundred or two doing a few thousand is a lot ;P

That's one I will need to doublecheck later, but did you actually catch all places were unicode symbols are used? Did you increase the size of all filename arrays? If some are missed, we're in for some nasty surprises at some point in the future.

I belive so as I think the only place that changes is the outputs of ResizeString() and TruncateString() since everywhere else was already storing UTF-8 just not displaying it correctly. Those ones need to be bigger now though as I changed them so that it resizes/truncates to an amount of characters instead of an amount of bytes.

d0k3 commented 3 years ago

Is it alright to just do 0x00-0x1F? Just added doing that, doing 0x7F-0xFF though is more problematic as most of those are also valid Unicode codepoints.

Yup, that's perfectly okay. No one uses anything beyound that.

I belive so as I think the only place that changes is the outputs of ResizeString() and TruncateString() since everywhere else was already storing UTF-8 just not displaying it correctly. Those ones need to be bigger now though as I changed them so that it resizes/truncates to an amount of characters instead of an amount of bytes.

If I got that right, having the proper macro everywhere isn't as critical as I thought. However, I wonder if we're wasting memory here. As you wrote, everything besides these two functions was already using UTF-8. Everything else doesn't need bigger buffers. Might not be that bad, unsure.

Anyways, sorry about this taking so long. This is a lot of code to check, so may still need a few days to fully approve it.

Epicpkmn11 commented 3 years ago

If I got that right, having the proper macro everywhere isn't as critical as I thought. However, I wonder if we're wasting memory here. As you wrote, everything besides these two functions was already using UTF-8. Everything else doesn't need bigger buffers. Might not be that bad, unsure.

It's not a waste of memory since like if you're copying ゲーム.nds around you'll need a 14 byte buffer for the file name, not 8, but that's already taken care of by path limits I believe as those would be based on byte count not letter count. The only time you should really need to be concerned about UTF-8 is when you're doing something based on the amount of letters in something instead of the amount of bytes.

redunka-zver commented 3 years ago

And yeah, the 3DS was never translated to Russian

It was, though?

rus

Games with actual Russian titles are indeed rare (even if said games do have official Russian localization), but some certainly exist, plus there's a bunch of system apps with translated titles as well.

Epicpkmn11 commented 3 years ago

Oh, my bad, I didn't realize they did as I haven't seen people talk about the 3DS in Russian like at all and I know the DSi wasn't translated to Russian. I'll try do some Cyrillic for the font then too.

Epicpkmn11 commented 3 years ago

font_6x10

ЁЄІЇ АБВГДЕЖЗИЙКЛМНОП РСТУФХЦЧШЩЪЫЬЭЮЯ абвгдежзийклмноп рстуфхцчшщъыьэюя ёєії

Added all of the Cyrillic characters that Russian uses and also the dotted i letters for Ukranian since why not, probably not needed but I've needed them for Ukranian in TWiLight and bootstrap and such. I can't read any Cyrillic languages so feedback is appreciated, this is heavily based on nds-bootstrap's font with just a few tweaks to fit in 6px or just generally fit in better with the existing latin. The one I'm least sure on is Б as I copied the style of latin B and D, but I'm not sure if that's appropriate for Б.

I also added □△▽◆◇◎●★ since those are in the DS firmware font and can be used by DS games, I know □ is used in Polarium's title (□■ POLARIUM ■□), not 100% sure if the rest are, but they could be.

Edit: Screenshot from within GM9: snap_210814093356

Edit 2: Sorry about all the editing lol, but I realized the Я was a bit off compared to this font's latin R so I changed it a bit, font at the top has been updated but the screenshot still has the old one.

redunka-zver commented 3 years ago

I can't read any Cyrillic languages so feedback is appreciated

I would say that everything looks really nice and perfectly readable, too!

d0k3 commented 3 years ago

Alright, great work. I just had my second go through all of your changes. Your code does look very clean, and I'm pretty sure if I missed something, it will at least not be bad. At this point, one remaing concern is the bigger sizes of path strings, and with that, higher stack memory usage. Even that shouldn't be too bad, though. Maybe it would still make sense if @Wolfvak had a look, cause this is quite a lot of changes.

One last thing I'm wondering about: Can you think about any way to integrate all that stuff in the software keyboard? Maybe in a separate section. I guess it would be too much work for too little gain and simply not worth it.

So, I'm basically ready to merge. Is there anything you still want to add?

Cyrillic looks great, by the way!

Epicpkmn11 commented 3 years ago

Good thing you mentioned the keyboard, I hadn't thought to test that yet and it was super broken with multi-byte letters. Just fixed that up so now I'm pretty sure both the keyboard and prompt should handle multi-byte letters correctly now. I don't think there's any major bugs in it but let me know if y'all find anything I need to fix. I did notice that naming a file the max length fails, but that's not a regression (v2.0.0 is the same) so I just left that as is.

It's currently still limited to ASCII for typing as at least it's not a regression now and I'm getting tired lol (it's almost 23:00 for me), but it should be relatively simple to add a Russian layout and not too bad to add a Japanese layout. Japanese does take a bit more effort though as it'll require some new special keys to modify letters (ex は+゛→ば, つ+small→っ, etc). I might try do that in the morning or maybe better to save that for another PR after this is merged.

Do you think it'd be worthwhile to add the ability to specify custom keyboard layouts somehow or just hardcode like QWERTY, some kinda Kana layout (probably like the 3DS keyboard is good), whatever's popular for Russian, and maybe something with accented latin?

I also gave the Kana in the font another look over and made several tweaks that should make it a bit easier to read. (comparison)

Other than potentially adding more keyboard layouts I think this is good to merge as far as I can tell.

Wolfvak commented 3 years ago

IMO we should just give users the option to input a custom Unicode character through its (decimal) codepoint and that's it. Most people don't really need the foreign characters.

Epicpkmn11 commented 3 years ago

Added a button to the number input mode on the keyboard that lets you convert the previous 4 characters to that Unicode codepoint if it's a valid hex number (otherwise it does nothing). (ex. 3041, 0041A, 0411 -> Б)

Is that a good way to do it? It seemed like the simplest way as there's already a hex number pad and I don't think it's too cramped with the one more button there.

Didn't bother with codepoints outside of the basic multilingual plane (above U+FFFF) since there's basically no reason to ever need those here and it's simpler to just keep it 4 hex characters = 1 codepoint as that's the normal way to represent Unicode up to U+FFFF.

snap_210821155024 snap_210821155028

d0k3 commented 3 years ago

@Epicpkmn11 - the solution you chose for inputing Unicode chars is pretty nice. I like it!

I just had a discussion with @Wolfvak and we decided to merge, but keep it in a separate branch for now (which will then be merged to master at a later point). We just want to make sure there are no problems in yet unexpected places, so more testing is still required. For any bugs found, we may tag you, and if there's stuff you want to add at a later point, you could also make a pr to that new branch.

Just one small request, before merging - this is 21 commits now, which is a lot. I'm pretty sure it wouldn't make much sense to squash it into just one commit, but could you reduce the number of commits a little bit by squashing what belongs together?

Epicpkmn11 commented 3 years ago

Alright, squashed it down to 7 commits. I hope I didn't break anything, I've never really done manual squashing before, it still seems to work fine though so I think it should be fine.

d0k3 commented 3 years ago

Merged to the unicode branch. We'll use that for testing and to continue work on this. Feel free to add more pull requests, should you see the need. And, a big thank you! You'll be in the next release notes for sure.