bruvzg / gdsdecomp

Godot reverse engineering tools
MIT License
1.36k stars 137 forks source link

convert translations #88

Closed raylu closed 1 year ago

raylu commented 1 year ago

I started adding support for converting .translation files to something plaintext, but I ran into a v3-v4 issue

        } else if (importer == "csv_translation") {
            err = export_translation(output_dir, iinfo);
Error ImportExporter::export_translation(const String &output_dir, Ref<ImportInfo> &iinfo) {
    Error err;
    Ref<Translation> tr = ResourceLoader::load(iinfo->get_path(), "", ResourceFormatLoader::CACHE_MODE_IGNORE, &err);
    ERR_FAIL_COND_V_MSG(err != OK, err, "Could not load translation file " + iinfo->get_path());
    ERR_FAIL_COND_V_MSG(!tr.is_valid(), err, "Translation file " + iinfo->get_path() + " was not valid");
    List<StringName> messages;
    tr->get_message_list(&messages);
    for (const StringName &s : messages) {
        print_line(s, tr->get_message(s));
    }
}

this gives me

ERROR: Cannot get class 'PHashTranslation'.
   at: instantiate (core/object/class_db.cpp:325)
ERROR: res://translations.en.translation:Resource of unrecognized type in file: PHashTranslation.
   at: load (core/io/resource_format_binary.cpp:781)
ERROR: Failed loading resource: res://translations.en.translation. Make sure resources have been imported by opening the project in the editor at least once.
   at: _load (core/io/resource_loader.cpp:228)
ERROR: Could not load translation file res://translations.en.translation

based on https://github.com/godotengine/godot/blob/72b845b28773dd40adf6f55b226fb732910cbf14/editor/project_converter_3_to_4.cpp#L1493, PHashTranslation seems to be the name of OptimizedTranslation in v3

calling ClassDB::add_compatibility_class("PHashTranslation", "OptimizedTranslation"); ahead of time causes a segfault when I try to load it, so they don't seem to be compatible

I'm not really sure where to go from here. gdsdecomp doesn't build against v3 and I'm not sure how it's able to load other v3 assets

nikitalita commented 1 year ago

Awww, hell yeah. I’m very happy to see this.

You can’t instance objects at all here unless you’re certain they're v4 compatible, because the object definitions for most of the resources changed between godot versions.

This is the reason I wrote ResourceLoaderCompat; I needed to be able to load binary resources without actually instancing any of the objects and then convert them to text. I ended up using it to extract properties from resources for exporting (like textures) without having to instance them.

Since ResourceLoaderCompat doesn't instance any of the objects, you wouldn't have access to a "real" resource with all its functions, but you'd have access to the properties. So you'd be able to get the messages that way.

So there are two options for the solution: 1) You have to manually extract the properties from translations by loading the resources with ResourceLoaderCompat and getting them from that, then extracting the messages property and whatever else you need, then outputting them as CSVs. For an example of how to do this, take a look at what I'm doing in texture_loader_compat.cpp for loading v2 textures and bitmaps. You'll have to add whatever class your thing is in (I suggest making a new one, like TranslationLoaderCompat) as a friend to ResourceLoaderCompat because I don't currently make the internal resource properties openly available, though I should when it comes time to refactor.

OR, you can modify the "real load" in ResourceLoaderCompat so that it loads a compatibility class that is just backported from v3 and v2, respectively. This part is probably harder than the above, and I don't actually use "real loading" in ResourceLoaderCompat for anything yet, so I don't recommend it unless you feel like taking on a challenge

If you'd like, you can PR your current changes to see what you're currently doing and give you tips.

nikitalita commented 1 year ago

Also, if you want to take a look at how the translation resources are structured when stored, you can use the bin to text option in the GDRE tools menu; it's good to have a reference to just look back at. I'd recommend doing that for each major version; v2, v3, and v4, so you can see what the differences are.

edit: this is unnecessary, the structure didn't change, see below. Here is an example of a bin to text .translation file from v3:

[gd_resource type="PHashTranslation" format=2]

[resource]
hash_table = PoolIntArray( -1, -1, -1, -1, -1, -1, -1, 0, 6, 16, 2, <...>
bucket_table = PoolIntArray( 1, 1, -558281573, 507, 50, 76, 2, 1, <...>
strings = PoolByteArray( 254, 80, 33, 3, 3, 71, 117, 6, 22, 36, 18, <...>
nikitalita commented 1 year ago

Taking a look at the history of PHashTranslation, it doesn't actually have the messages property, it's just an optimized hash table. However, we got lucky here in that there aren't any actual changes to the underlying structure from v2 to v4, it was just pointlessly renamed to OptimizedTranslation. So all you would have to do is create an object pointer that is instantiated with the type OptimizedTranslation, set it with the properties extracted from ResourceLoaderCompat, then reference it as an actual OptimizedTranslation.

Example:

Object *obj = ClassDB::instantiate(type);
if (!obj) {
    return ERR_PARSE_ERROR;
}
// set properties
//Properties in optimizedtranslation:
//  Vector<int> hash_table;
//  Vector<int> bucket_table;
//  Vector<uint8_t> strings;
obj->set("hash_table", hash_table);
<etc..>
Ref<OptimizedTranslation> ref = Ref<OptimizedTranslation>(Object::cast_to<OptimizedTranslation>(obj));

Then get the messages that way.

However, looking at the function implementations here, there doesn't seem to be a way to dump all the messages at once, and it's not a real HashMap so you can't dump the keys and values that way. You may have to create a child class of OptimizedTranslation and cast the OptimizedTranslation object, and write custom functions to get the individual elements.

But in either case, I'd try get_message_list and see what happens; it may be empty since it's not actually implemented in OptimizedTranslation and the parent function Translation::get_message_list() references the translation_map, which doesn't seem to be set in OptimizedTranslation.

nikitalita commented 1 year ago

calling ClassDB::add_compatibility_class("PHashTranslation", "OptimizedTranslation"); ahead of time causes a segfault when I try to load it, so they don't seem to be compatible

btw, I tried to reproduce this using your examples, but I couldn't do so. I think you may have added this to the inner loop and added it multiple times, causing it to overflow and cause a seg fault. Try adding it outside of it.

If that works, then this becomes a lot easier. You can do a real load using ResourceFormatLoaderCompat (which is recommended because ResourceFormatLoader can pollute the path cache):

Error ImportExporter::export_translation(const String &output_dir, Ref<ImportInfo> &iinfo) {
    Error err;
    ResourceFormatLoaderCompat rlc;
    // translation files are usually imported from one CSV and converted to multiple "<LOCALE>.translation" files
    for (String path : iinfo->dest_files) {
        Ref<Translation> tr = rlc.load(path, "", &err);
        ERR_FAIL_COND_V_MSG(err != OK, err, "Could not load translation file " + iinfo->get_path());
        ERR_FAIL_COND_V_MSG(!tr.is_valid(), err, "Translation file " + iinfo->get_path() + " was not valid");
        List<StringName> messages;
        tr->get_message_list(&messages);
        for (const StringName &s : messages) {
            print_line(s, tr->get_message(s));
        }
    }
    return OK;
}

BTW, I did test get_message_list and it does not work, unfortunately. the unit test even checks to make sure it doesn't work. So, you will have create a child class of OptimizedTranslations and figure out how to get the individual elements out of the hash map; take a look at struct Bucket in optimized_translations.h

raylu commented 1 year ago

thanks for looking into this and explaining everything

I wasn't using ResourceFormatLoaderCompat, just regular ol' ResourceLoader::load

I have bad news though: the developer gave me the imported translation CSV, so this went from the top of my priority list to the bottom...

nikitalita commented 1 year ago

😭

nikitalita commented 1 year ago

I decided to implement it anyway. Give the standalone build artifacts from the CI run a try once they're finished building. https://github.com/bruvzg/gdsdecomp/actions/runs/3317312034

raylu commented 1 year ago

wow, nice! when I click to download "GDRE_tools-standalone-linux" on that page, the little blue progress bar at the top just slowly crawls but it never loads. when I curl it, it says HTTP request sent, awaiting response... 404 Not Found

shame that we're not always able to recover the keys :(

nikitalita commented 1 year ago

wow, nice! when I click to download "GDRE_tools-standalone-linux" on that page, the little blue progress bar at the top just slowly crawls but it never loads. when I curl it, it says HTTP request sent, awaiting response... 404 Not Found

You have to be logged into download it; try opening it up in a new tab.

shame that we're not always able to recover the keys :(

Yeah, and there’s no real way to do it programmatically either. You can’t recover them from the hash values, and because the key can be literally anything and stored as any member value, there’s no way to search the project for it.

The best we could do is a Translation editor, where people could edit in new translations and we then store them as a new OptimizedTranslation with the hash values from other translations. That’s a lot of work though, which is why I just tell people in the warning message to ask the creator.

raylu commented 1 year ago

just tried the build and the .assets/translations.csv output is correct! it says they're missing keys but the game uses one of the languages as the keys and it either found that or that's the default translation or something (I didn't entirely understand the default_messages guessing code). if there's ever a discrepancy between the sheet I have and the game assets, this will help

nikitalita commented 1 year ago

How that works is: We search for the locale/fallback setting in the pck's project.godot to determine what the default language is. If it's not set, then it defaults to English. Then we retrieve the message values for each translation, and if one of them is the default fallback language, we store the message values for that language as default_messages. This is because it is likely that the message value for the default language will be the key or part of the key.

We then cycle through all the message values in the default translation, and try get_message(key) to determine the key by matching the message value with the message retrieved from get_message. The keys that we try are the message value itself, and several permutations thereof (appending $$, TL_, stripping punctuation, etc.) For example, the key for the message displayed in a "Password" box may be "$$Password". If one of them results in us getting a message value that matches what we have, we use that. If we can't find it, we store it as <MISSING KEY [message]>

It sounds like the locale/fallback language may be set to something other than what they actually intended to be the default language. What language is the game in by default when you open it? I might want to look at the project to see if I can improve that.

raylu commented 1 year ago

the game I'm datamining has frequent updates and happens to ship the translation keys as a random language (yi_US) to help translators see where the keys are rendered in-game. so it's actually very helpful to extract just the strings (which happen to have the keys for this game)!

raylu commented 1 year ago

regarding your comment on the PR,

When you import a translation CSV, it gets stored as OptimizedTranslation files that only store the hashes of the keys, rather than the keys themselves. It's not possible to recover the keys from the hashes, and we can't programatically get them from project resources since they can be any string value and stored in anything.

are you saying that the original strings are in the project resources and we just don't know which one it is? what if we just hash every string and look for matches?

nikitalita commented 1 year ago

I had thought about that, but for any project with a non-trivial amount of scripts and resources, that would be an huge amount of strings and would be insanely slow. That might be justified if the object is to recover the translation.csv, so it could be an optional thing, but there's a lot of modifications I would have to make to script/resource loading and parsing to make that happen. I'd have to load and parse every single resource and script and capture every string.