DFHack / dfhack

Memory hacking library for Dwarf Fortress and a set of tools that use it
Other
1.84k stars 462 forks source link

Translation::capitalize() doesn't work correctly for CP437 characters #2855

Open johncosker opened 1 year ago

johncosker commented 1 year ago

See: https://github.com/DFHack/dfhack/issues/2850

The original issue was resolved without needing to fix the capitalization function, creating this to track the unicode issue.

lethosor commented 1 year ago

Congrats, you found some code that hasn't been touched in almost 11 years :)

In all seriousness, this is because capitalize() uses toupper(), which presumably uses the system locale, which is very unlikely to be CP437. It could be debated whether this is the correct behavior or not. Given that most of the Translation module is intended to work with DF names, I think handling CP437 would be reasonable.

...I just realized there is significantly more detail in #2850 so I will move there.

johncosker commented 1 year ago

Congrats, you found some code that hasn't been touched in almost 11 years

Feels like I'm at my job lol

If this behavior isn't incorrect than no need to keep this open. I'll leave it to you to decide.

Since it probably is usually just for display then it shouldn't really matter.

lethosor commented 1 year ago

Based on your comment in https://github.com/DFHack/dfhack/issues/2850#issuecomment-1423575684, DF presumably has a way to capitalize non-ASCII letters, so I think it's worthwhile to change on our end too.

Dealing with encodings is tricky if we want to support Unicode, though - we don't really have a standard for handling that.

Bumber64 commented 7 months ago

DF does it with a simple switch on the char byte. I'll post my reverse-engineered function here since it's short enough:

string string_capitalize(const string &str) // DF's function always capitalizes all words
{
    string out = str;
    int32_t bracket_level = 0;
    bool success = false; // First capitalization succeeded; true implies i > 0

    for (size_t i = 0; i < out.size(); i++)
    {
        if (out[i] == '[')
        {
            bracket_level++;
            continue;
        }
        else if (out[i] == ']')
        {
            bracket_level--;
            continue;
        }
        else if ( bracket_level > 0 || // Skip anything in brackets
            success && out[i-1] != ' ' && out[i-1] != '"' && // Once first capitalization succeeds, no new capitalization until any space or quote,
            (out[i-1] != '\'' || i == 1 || (out[i-2] != ' ' && out[i-2] != ',')) ) // or until apostrophe immediately following space or comma
        {
            continue;
        }

        out[i] = toupper(out[i]);

        switch(out[i])
        {
            case '\x81': // 'ü'
                out[i] = '\x9A'; // 'Ü'
                break;
            case '\x82': // 'é'
                out[i] = '\x90'; // 'É'
                break;
            case '\x84': // 'ä'
                out[i] = '\x8E'; // 'Ä'
                break;
            case '\x86': // 'å'
                out[i] = '\x8F'; // 'Å'
                break;
            case '\x87': // 'ç'
                out[i] = '\x80'; // 'Ç'
                break;
            case '\x91': // 'æ'
                out[i] = '\x92'; // 'Æ'
                break;
            case '\x94': // 'ö'
                out[i] = '\x99'; // 'Ö'
                break;
            case '\xA4': // 'ñ'
                out[i] = '\xA5'; // 'Ñ'
                break;
        }

        success = true;
    }

    return out;
}

Technically the DF version modifies the original string instead of returning a new one. I also simplified some logic. Other than that, this is the capitalization function that DF calls from the translate_name global function in v50.11.

You'll note that it skips capitalizing letters in brackets (probably token related*) and has more sophisticated rules than just capitalizing after a space.

*And while you're busy questioning why a token would be in a name, let me inform you that DF doesn't sanitize nicknames before parsing them into text boxes, so at least those tokens won't be erroneously capitalized!

ab9rf commented 7 months ago

what's the source address (please remember to specify DF version and edition since addresses vary across builds) of that reverse-engineered function? it may make sense to request this one as an export instead of reengineering it.

(hm, if it's called from the exported translate_name i can probably find it that way)

found it: FUN_1403c2620 (50.11 Windows STEAM)

Bumber64 commented 7 months ago

it may make sense to request this one as an export instead of reengineering it.

As I mentioned, DF's version modifies the string in place instead of creating a new one. I guess we could have our function copy the string first, then call the exported one? But our all_words = false version can't use that either way. (Is that version actually used anywhere? Considering all it does is call toupper on the first char of the string.)

quietust commented 7 months ago

I'm pretty sure the function you reverse-engineered is actually void capitalize_string_words(string &str), straight out of g_src/basics.cpp (which also contains functions "upper_case_string" and "lower_case_string" which work very similarly).

ab9rf commented 7 months ago

But our all_words = false version can't use that either way.

which is what void capitalize_string_first_word(string &str) (also in g_src/basics.cpp) is for

ab9rf commented 7 months ago

as an aside, having simplify_string available for searching would be helpful, since that's what allows e.g. 'a' to match all of the things that "look like" an 'a'

lethosor commented 7 months ago

We already have our own version of simplify_string() in https://github.com/DFHack/dfhack/blob/68190c3fbb37694a2ff2eb249a7e2b4a93b0c949/library/MiscUtils.cpp#L154. Admittedly not identical to DF's, but ours handles some things that DF's does not, e.g. Æ.