SC:R uses only cp949 to decode strings in TBL?

Chromowolf commented 3 years ago

(using euddraft 0.9.1.2) main.eps:

//Encode "生命" using utf-8: E7 94 9F E5 91 BD
//Decode "E7 94 9F E5 91 BD" using cp949: illegal character
//Decode "E7 94 9F E5 91 BD" using char8_t: ç”Ÿå‘½
function onPluginStart() {
    wwrite(0x660260 + 2 * $U("Terran Marine"), $T("生命")); //Modify "Terran Marine" name to "生命" using strX section
    settbl($B("Terran Ghost"), 0, "生命"); //settbl uses cp949 to encode "生命"
    dbstr_print(GetTBLAddr("Terran Vulture"), "生命\x00"); //dbstr_print uses utf-8 to encode "生命"
}

I got this:

My conclusion: SC:R uses only cp949 to decode strings in TBL, and won't use utf-8.

Is my conclusion right? Is there anything euddraft can do to let SC use utf-8 to decode strings TBL?

armoha commented 3 years ago

See https://github.com/Buizz/EUD-Editor-3/issues/36 tl;dr: stat_txt.tbl strings can be CP949, UTF-8 or windows-1252. CP949 has highest precedence.

armoha commented 3 years ago

https://github.com/armoha/eudplib/blob/master/eudplib/eudlib/stringf/tblprint.py#L32 f_settbl always converts "string literal" inputs to CP949.

settbl("Terran Ghost", 0, ("生命\u2009").encode("UTF-8"));

Passing bytes rather than string would work. 'Thin space' character (U+2009, e2 80 89 in UTF-8) in ending ensures tbl to not decodable with CP949, forcing it to always intepret as unicode.

Without Thin space character, 生命 in utf-8 (e7 94 9f e5 91 bd) is decoded with windows-1252, displayed as ç”Ÿå‘½.

Chromowolf commented 3 years ago

function onPluginStart() {
    settbl("Terran Ghost", 0, ("生命\u2009").encode("UTF-8")); //Error
    //dbstr_print(GetTBLAddr("Terran Vulture"), "生命\u2009\x00"); //OK
}

[epScript] Compiling "main.eps"... [Error -2] Module "main" Line 2 : General syntax error [Error 6298] Module "main" Line 4 : Block not terminated properly.

but dbstr_print(GetTBLAddr("Terran Vulture"), "生命\u2009\x00"); is OK BTW, euddraft doesn't seem to have f_encode ?

armoha commented 3 years ago

py_str("生命\u2009").encode("UTF-8") would work then ;(

Chromowolf commented 3 years ago

It works. Thanks a lot!!!! So, it turns out that: Starcraft uses "utf-8" as the highest priority when decoding strings in string section, but uses "cp949" as the highest priority when decoding strings in stat_txt.tbl? This inconsistency is so weird :(

armoha commented 3 years ago

Starcraft uses "utf-8" as the highest priority when decoding strings in string section, but uses "cp949" as the highest priority when decoding strings in stat_txt.tbl?

Yeah exactly xD IMO it's because back to 1.16 eud maps, tbl editting (CP949) was so common that SC:R had to prioritize to support them. In contrast, only few map editted STR content in-game so SC:R could move on to unicode, breaking little number of maps.

Chromowolf commented 3 years ago

Thank you. (I should have asked you this question 2 years ago, lol.) Our group of map makers are very grateful to your help. Is there any method we could sponsor/donate to you? (Like patreon or any other means.)

Chromowolf commented 3 years ago

What about if I wanna use settblf instead?

const ss = Db("生命");
function onPluginStart() {
    //sprintf(GetTBLAddr("Terran Ghost"), "{:s}\u2009", ss); // I know this is OK
    settblf("Terran Ghost", 0, py_str("生命\u2009").encode("UTF-8")); //Got error: expected str, got bytes
}

armoha commented 3 years ago

What about if I wanna use settblf instead?

const sm = EPD(Db("生命"));
// f_settblf, f_settblf2(tbl, offset, format_string, *args)
settblf("Terran Ghost", 0, "{:t}\u2009", sm);

armoha commented 3 years ago

Our group of map makers are very grateful to your help. Is there any method we could sponsor/donate to you? (Like patreon or any other means.)

Thank you for support! I opened my BuyMeACoffee just now. https://www.buymeacoffee.com/armoha

Chromowolf commented 3 years ago

const sm = EPD(Db("生命"));
// f_settblf, f_settblf2(tbl, offset, format_string, *args)
settblf("Terran Ghost", 0, "{:t}\u2009", sm);

UnicodeEncodeError: 'cp949' codec can't encode character '\u2009' in position 2: illegal multibyte sequence.

I don't know how to force settblf to encode the format string in utf-8, cuz py_str can't apply here.

armoha commented 3 years ago

const sm = EPD(Db("生命"));
// f_settblf, f_settblf2(tbl, offset, format_string, *args)
const unicode_tbl = py_str("\u2009").encode("UTF-8");
settblf("Terran Ghost", 0, "{:t}{}", sm, unicode_tbl);

I think I should add settblf(encoding="utf-8"); option..

Chromowolf commented 3 years ago

Thank you. This is a work-around, but looks awkward, and seems to lose the convenience of format string... If the format string is "some utf-8 char{:c} utf-8 char{:s}, utf8 char xxx {:n}", then I must do

const uni01 = py_str("some utf-8 char ").encode("UTF-8");
const uni02 = py_str(" some utf-8 char").encode("UTF-8");
const uni03 = py_str(" utf8 char xxx ").encode("UTF-8");
const uni = py_str("\u2009").encode("UTF-8");
settblf("Terran Ghost", 0, "{}{:c}{}{:s}{}{:n}{}", uni01, playerID, uni02, someAddr, uni03, playerID, uni);

which takes the same effort as

settbl("Terran Ghost", 0, "some utf-8 char", playerID, " utf-8 char", someAddr, " utf8 char xxx ", playerID, "\u2009");

And you know this for sure.... So I think the encoding="utf-8" is necessary if there is no other work-around.

(Hope one day the world could be unified to utf-8)

armoha commented 3 years ago

@Chromowolf Updated to 0.9.1.4 https://github.com/armoha/euddraft/releases/tag/v0.9.1.4

settblf("Terran Ghost", 0, "{0:c}生命{0:n}", player, encoding="utf-8");
// write "<playerColor>生命<playerName>\u2009\0" on Terran Ghost tbl

f_settbl: Added encoding parameter (default: "CP949") f_settbl(tbl, offset, *args, encoding="cp949") f_settblf(tbl, offset, format_string, *args, encoding="cp949") encoding specifies which encoding str arguments will use. When encoding is "utf-8", f_settbl or f_settblf appends "\u2009\0" at end of tbl string, to ensure SC:R to always interpret as unicode entry. (Partial edit functions f_settbl2, f_settblf2 do not add any null terminator or thin space character.) It is user's responsibility to use same encoding in other types of arguments; bytes, Db etc.

Chromowolf commented 3 years ago

f_settbl(tbl, offset, *args, encoding="cp949")

function onPluginStart() {
    settbl(1, 0, "abc", encoding = "cp949");
}

euddraft 0.9.1.4 : Simple eudplib plugin system - This program follows MIT License. See license.txt - Press SHIFT to force check update while opening euddraft. - Daemon mode. Ctrl+C to quit. R to recompile (windows only) ... ... [Error -2] Module "main" Line 2 : General syntax error [Error 6298] Module "main" Line 3 : Block not terminated properly.

armoha commented 3 years ago

:( settbl(1, 0, "abc", encoding=py_str("cp949")); in epScript

Chromowolf commented 3 years ago

Thx. Sorry for this stupid question :P

armoha commented 3 years ago

Nah it's my fault in documenting, forgot A = "B" pattern haven't been allowed yet in epScript.

armoha commented 9 months ago

Closes as completed, please re-open or open new issue if you have any question.

FYI: from euddraft 0.9.9.9, [dataDumper] plugin detects whether binary data is encoded by CP949 or UTF-8, and send this info to eudplib. Related commits: https://github.com/armoha/euddraft/commit/e6dcc9b974e67792c25933e4f781c445a2b66d7f and https://github.com/armoha/eudplib/commit/e2f148cac4c07b2e2eb768332f5c880d9c64c0c4

armoha / euddraft

SC:R uses only cp949 to decode strings in TBL? #23