BGforgeNet / Fallout2_Restoration_Project

Fallout 2 Restoration Project, updated
https://forums.bgforge.net/viewforum.php?f=39
542 stars 37 forks source link

The Vietnamese msg-files seem to have character encoding errors #301

Closed LNx2 closed 20 hours ago

LNx2 commented 2 days ago

The Vietnamese msg-files seem to not encode certain characters correctly. (I assume they are trying to use CP1258 aka WINDOWS-1258)

For example this section of the Vietnamese po-file:

#: dialog/abbill.msg:100
msgid "You see trader Bill."
msgstr "Bạn gặp thương gia Bill."

results in this line in data/text/vietnamese/dialog/abbill.msg:

{100}{}{B?n g?p thương gia Bill.}

The bytes look like this (in hexadecimal):

7B 31 30 30 7D 7B 7D 7B 42 3F 6E 20 67 3F 70 20 74 68 FD F5 6E 67 20 67 69 61 20 42 69 6C 6C 2E 7D

So and get 'encoded' as 0x3F aka ? instead of the proper 2 byte sequences using combining diacritical mark.

burner1024 commented 2 days ago

Yes, I mentioned that in chat. Didn't know if that's a charset or a font problem. If hex values are the same, then I guess windows-1258 doesn't support these characters at all?

$ echo "Bạn gặp thương gia Bill." | iconv -t cp1258
Ba�n g��p th��ng gia Bill.

You'll need to figure out what's the correct way to convert it. Fixing the converter itself shouldn't be much of an issue.

As for translations using multibyte encodings, they also need specially hacked fallout2.exe, I know Chinese and Japanese translations are using it, but not sure where to get it.

LNx2 commented 1 day ago

When it comes to encoding issues it's better to look at the byte code directly in my opinion, your (Linux) shell usually won't display letters that aren't UTF-8 encoded (all ASCII is valid UTF-8, that's why those letter are displayed even after the conversion to cp1258). Something like this is usually clearer (at least for me 😉):

$ echo "Bạn gặp thương gia Bill." | iconv -t cp1258 |hd
00000000  42 61 f2 6e 20 67 e3 f2  70 20 74 68 fd f5 6e 67  |Ba.n g..p th..ng|
00000010  20 67 69 61 20 42 69 6c  6c 2e 0a                 | gia Bill..|

Here the "ạ" gets properly encoded as 0x61 for an ASCII "a" plus 0xF2 for "◌̣" a so called diacritic mark, that modifies the previous letter. It works analogously for the other previously corrupted letters. So Windows-1258 can encode all those Vietnamese letters, but for some it uses two bytes.

The first issue (and the one I was trying to raise here 😉) is that whatever program converts the strings from the po-files to the msg-files seems to drop any unicode character that would need 2-Bytes in CP1285 and just replaces it with ASCII question marks "?", I suppose as some kind of fallback error mechanism.

The second issue (that I was blissfully unaware of until you mentioned it 😉) is that using 2-Bytes for a character might not work with the current fallout engine.

There are different encodings for Vietnamese that might or might not work better here.

But since I don't speak any Vietnamese (I just stumbled across this issue by poking around some files) nor have any knowledge about how the engine handles the msg strings, I'm not the right person to choose a 'least bad encoding'.

Is the chat you mentioned discord? And would it be better to raise the issue there, to get people with the necessary skills to take a look?

burner1024 commented 20 hours ago

Sorry, I assumed you were with the Vietnamese team. They requested the language in Discord, yes, but it's birdged to other linked chats, too.

Short version, windows-1258 is crap, and Python doesn't properly handle it. I added a hack to msg2po, appears to work, although not entirely sure it's correct, but without actual Vietnamese knowledge hard to tell. Files will be updated with the next action run.