SeanPesce / DXMD-Translations

Language translation framework for Deus Ex: Mankind Divided and Deus Ex: Breach
GNU General Public License v2.0
3 stars 1 forks source link

A couple of requests #1

Open KillerBeer01 opened 2 years ago

KillerBeer01 commented 2 years ago

Hi!

First, I must congratulate you on successfully breaching ( :D ) MD's data file formats. As someone who spent a lot of time trying to figure what's what in HR's .drm files and how to make sense of randomly scattered and mismatched game texts - trust me, I can appreciate the feat. Seeing as even less people (practically noone) makes mods for MD, I believe that its engine is even more convoluted for someone without a native SDK, and my hopes to try my hand at MD's translation were nearing zero... and then I saw your project.

So, there are a few questions I have. First, wouldn't it be too hard to generate strings.json for other languages already present in game? Although the official localization for my language does exist, my previous experience says that there's no such thing as a game translation without glaring blunders, and there are always things to fix.

Second one perhaps would be harder - do you happen to know how to extract voice audio from game's data files into a playable format, and more important, how to link them to texts? So, if I see a string with content.id 1100392264 and text "If this was the only information we had, we probably would call it off.", I could find the respective file and listen to the way this line sounds in game. I firmly believe that hearing the original actors' voice makes a key difference between doing a more or less decent translation and a really "live" one.

Finally, the question that is probably hardest and time consuming, and I'd completely understand if you don't want to go in there, but I have to ask: any insights on where the information about game dialogues is stored and how does it work? That is, I can see in the .json that all lines for a particular dialogue are grouped with the same resource name and id, but not the flow of the dialogue itself. My ultimate goal would be to make something that provides not just the left view but also something closer to the right one, so to speak: https://ibb.co/fqLVFzT

Thanks in advance. You're my only hope, Obi-Wan Kenobi :D .

SeanPesce commented 2 years ago

Thanks for the interest in the project!

My current understanding of the game files is extremely limited - I wrote custom tooling to extract inner files from the *.archive files, and subsequently from the *.pc_resourcelib files (mapped by the *.pc_headerlib files). From there I extracted the bulk of the English data from TextList files and the cutscene subtitles from a few .pc_resourcelib files that were quite puzzling to me. I got my tools working just enough to get the job done, so the code is a mess and there are plenty of parsing issues. The only tool that I feel has reached a releasable state is my archive extractor. A lot of the magic for this project is done by in-game code hooks that replace string data at run-time.

I haven't attempted to extract non-English strings from the game yet, though I agree that it's a good idea and I'd like to have JSON data sets for all the officially-supported languages. The thing is, I'm not actually sure if my copy of the game shipped with the non-English data; I'll have to take a look at some point.

As for the relationships between different files (video -> localized audio/subtitles, etc.) - I don't actually have an understanding of how that works yet. I believe the relations are defined inside scene/blueprint/entity/localization files, but I haven't done any reversing of these sub-formats yet. I'm currently using a brute-force approach with some custom in-game hooks to map video runtime IDs to their respective subtitles ~; I still have 24 cutscene subtitles that I need to find the video IDs for before cutscene translation will work 100%~.

~My immediate goal is to manually map out the last few video IDs to have a "complete" source data set for people to create new translations with. Once that's done,~ I need to create more run-time hooks to load extended fonts (e.g., to support the full Turkish alphabet). Once all that is done I plan to go back to reversing the game files to get a better understanding of the formats and relationships so that I can create tooling and/or documentation for other modders.

As far as dialog flow/state machine data, unfortunately I haven't looked at that at all yet so I don't have any helpful information. Hopefully one day I'll have the time.

Something to keep an eye on (and admittedly I haven't been myself) - the 2021 Guardians of the Galaxy game released by Square Enix runs a newer iteration of the Dawn Engine, so I imagine the game files will have many shared formats. GotG also probably has a bigger modding community than DXMD at the moment, so they might have already surpassed my knowledge of these formats (especially seeing as I've been doing all my work alone). It would probably be helpful to look into any tools/research that GotG modders have released to use them as a reference (and feel free to use this GitHub issue as a place to keep track of any such related projects).

I wish I could be of more help, but unfortunately that's all I know for now. These game files are extremely complex, and due to my limited schedule it takes me a long time to make progress on reversing them. Feel free to add me on Discord (SeanP#5604) if you have binary analysis/reverse engineering experience and plan to start working on DXMD - it's a lonely road out here haha.

KillerBeer01 commented 2 years ago

I got my tools working just enough to get the job done, so the code is a mess and there are plenty of parsing issues.

Sounds exactly like my experience with HR :D.

Oh well. I did some binary analysis/reverse engineering with previous DX games, but honestly, Crystal Dynamics looked like it intentionally wanted to make things rotten for people like us, and from what I can see, Dawn is only worse in that regard. Not sure I'm ready for another experience like that right now... but we'll see.

As for now, I've not managed to launch MD with your update. I placed three files into /retail subfolder, and when I start DXMD.exe, it disappears from processes a couple of seconds later, the launcher window never appears. My version is Steam, v1.19 801.0, Windows 10, MD5 a227bc2145d592f0f945df9b882f96d8, language set to English. There's no error message, and I'm not sure if there's any log file to check. I also tried to replace strings.json with en_ascii.json file, to eliminate possible encoding factor, but with the same result.

SeanPesce commented 2 years ago

Interesting, which release of the mod are you running? There's still a bunch of logging/error checking I need to implement now that I have most of the core functionality done. For example, there are a lot of cases where the game will immediately crash if it encounters a JSON parsing issue. Can you make sure to use the latest (currently 2022-04-27) build, including the exact config and strings.json it shipped with?

I'll try to add some logging today so we'll have something to look at in the case of a crash.

KillerBeer01 commented 2 years ago

Right. I must have downloaded a release few hours before you renewed it. Now it works, although at first it complained about missing vcruntime140_1.dll - maybe it's worth to make more urgent requirement for latest C++ redistributable in readme. Now I can play in Turkish. Teşekkür ederim, efendi! :D

SeanPesce commented 2 years ago

Here's a newer build with logging and additional error checks - I won't be able to publish it on the releases page until later today though. Thanks for the info about vcruntime140_1.dll, I'll have to make a note of that!

2022-04-28_tr.zip

KillerBeer01 commented 2 years ago

Hi again :D

The thing is, I'm not actually sure if my copy of the game shipped with the non-English data; I'll have to take a look at some point.

Well if your version's Game.layer.0.all.archive size is 24,594,575,271 bytes (as it is in mine), then language data is definitely there. Check if you have following files unpacked:

\29FC746F4AF35D7B7B35EB9525C8811C.pc_resourcelib \4AFE0047ED07FB7A9AE557D5802CF369.pc_resourcelib \4D22A53A7B1579F603AAC52E536456A4.pc_resourcelib \727FA5ADE9F2D35832A3A3587D691D5F.pc_resourcelib \82DA8D828381ABF21CDEDE82FDCDE97C.pc_resourcelib \8F34D94C4F29783B2E0DD29D9EE2FE4F.pc_resourcelib \AD8E2D9312E615F81809017421E35C8B.pc_resourcelib \F879C480DB0BA7FE2FAD8CCB6529CC35.pc_resourcelib

These are various versions of the same dialogue (I haven't found the Japanese text):

{
 "resource": "[assembly:/localization/textlists/conversations/20_prague/06_pra_trainstation/m02_436_con/m02_436_con_en.textlist].pc_textlist",
 "id": 33074978347905056,
 "content": [
   {
     "id": 2121874423,
     "string": "Jensen! What's wrong, what are you doing here?"
   }, 

Not sure if it helps, though... I still have no slightest idea how those "code hooks" work in thes engine.

SeanPesce commented 2 years ago

Looks like I do have some other language data then! IIRC I only extracted files with language specifiers of "en" or "all"; I'll have to make another pass at some point and see if I can extract the other language data.

KillerBeer01 commented 2 years ago

All texts were unpacked from the "all" file. As much as I understand, "en" file only contain voice files, and additional files may be downloaded if you select different voice in "Languages" menu, while changing subtitles only requires a restart.

KillerBeer01 commented 1 year ago

Hi again :D.

It seems you're too busy with life to afford to keep the project developing. So I have to ask: would it be possible to familiarize with the code you used to extract texts from .pc_resourcelib files? I'm still eager to find where those non-English texts can be taken from, but they looktoo intimidating to begin without at least some headstart.

SeanPesce commented 1 year ago

Hah, you're definitely correct - I have quite a bit going on (both IRL and with other projects) that's prevented me from working on this recently.

Here is a rough document I made a while back that details some of my (very incomplete) knowledge of the pc_resourcelib format. I hope it'll be of some use to you while I'm busy - and obviously feel free to hit me up if you need more info!

KillerBeer01 commented 1 year ago

Thanks for that... unfortunately, while it does shed some light on pc_headerlib structure, it's still too vague on sizes of fields, and although I managed more or less figure the RefsChunk part enough to organize those headers into a semblance of a tree, BIN1Chunk is... too much BIN to grasp where one of its fields ends and another begins, and how many bytes must be read till the more familiar stuff can be retrieved. For example, I can see words that I can clearly identify as "Languages" but not an integer that would match the size of the array, I can see a BILR start of HeaderLibDataChunk header, but much more that 24 bytes between it and a nearest "magic word" of a ResourceHeader... it's all very confusing. I'm trying to google GotG and Hitman Absolution forums for bits of usable info, but it seems that data file formats in them are slightly different just enough to doubt reliability of anything found in there.

Once again, props to you for managing to figure at least some sense in this nightmare of an engine.

KillerBeer01 commented 1 year ago

Hi again,

I finally completed my own version of text extractor that worked for all languages. There are some differences between the output it produces for me and the version of texts that you have; some of them might be caused by differences in builds, and I'm not sure how the subtitle loader engine would interpret such. I'm uploading my current results so far, along with the version of your subtitles sorted in the same physical order, in case you want to analyze the differences in a file compare utility of choice.

https://fastupload.io/en/vJznYp6pdOWQ5ry/file

Some key points:

============== "resource": "[assembly:/localization/textlists/ui/user_interface/ui_text_pc_only/ui_text_pc_only_en.textlist].pc_textlist", "id": 36434583166238492, Your version ends with two entries that are not in mine

    {
      "id": 3815676240,
      "string": "This DLC is not available with your current steam game language setting. To play this DLC you must choose another language by selecting<B>Library</B>,<B>Deus Ex: Mankind Divided™</B>,<B>Properties</B> and then <B>Language</B>"
    },
    {
      "id": 30095272,
      "string": "This DLC is not available."
    }

And that's not a parser error on my part, these texts were simply not in resource files. I suppose these entries can be added to he text file manually, but I wonder what would happen if a build that does not expect to have them finds itself being updated by the loader with these particular IDs.

============== These subtitles have additional texts:

      "video_id": 68324659120964398,
      "subs": [
        {
          "start": "1.50",
          "end": "6.50",
          "string": "Translation Mod by Sean Pesce"
        },
        {
          "start": "16.20",
          "end": "23.70",
          "string": "I once thought I could save the world. Now look at it."
        },
        {
          "start": "28.00",
          "end": "41.70",
   -->  "string": "In yet another augmented terror attack, 251 passengers aboard Cista Airlines flight 451 were killed when an augmented passenger broke into the plane's cockpit and ruthlessly butchered its flight crew."
        },
        {
          "start": "44.50",
          "end": "51.00",
          "string": "[Arabic] Kill him. Kill him now!"
        },
        {
          "start": "51.50",
          "end": "68.00",
     -->  "string": "Details recovered from the blackbox recorder suggest that the man may have been suffering flashbacks to the \"Aug Incident\" - that horrible day two years ago when augmented people all over the world flew into a psychotic killing spree, causing the greatest loss of life in recent history."
        },

{
  "resource": "[[assembly:/scenes/_default/default_scene.entity].pc_resourcelibdef](0009).pc_resourcelib",
  "id": "2BD0EB53CEBEBE2F92C48ECBAF7512CD",
  "content": [
    {
      "video_id": 31647727641083435,
      "subs": [
        {
          "start": "0.30",
          "end": "7.50",
   -->    "string": "Rippers: The next-generation of hackers who use virtual reality simulations to extract data from secure corporate servers.\n\nNeo-English Dictionary\nCopyright © 2029"
        },
        {
          "start": "8.27",
          "end": "17.03",
          "string": "Investors around the world today rejoiced at a decision by the Czech Republic to approve expansion plans for the Palisade Bank Corporation. "
        },
        {
          "start": "17.03",
          "end": "25.10",
          "string": "Located in Prague, the bank's iconic Blade facilities hold the largest and most secure data-archiving vaults in the world. "
        },
        {
          "start": "26.23",
          "end": "35.47",
          "string": "Cutting-edge security measures have protected the sensitive secrets of mega-corporations and influential individuals since privacy laws first passed."
        },
        {
          "start": "35.48",
          "end": "36.22",
   -->       "string": "[(Typing) For years they thought their secrets were safe. And they were right. Until now.]"
        },
        {
          "start": "36.23",
          "end": "41.00",
   -->    "string": "[(Typing) For years they thought their secrets were safe. And they were right. Until now.]\n\nSources close to the bank tell me that no hacker has ever come close to breaching the Blade's defenses, despite an almost astronomical number of attempts. "
        },
        {
          "start": "41.00",
          "end": "45.40",
          "string": "Sources close to the bank tell me that no hacker has ever come close to breaching the Blade's defenses, despite an almost astronomical number of attempts. "
        },
        {
          "start": "47.60",
          "end": "51.67",
          "string": "Between you and me, folks, I think they are just wasting their time. "
        },
        {
          "start": "52.53",
          "end": "57.80",
          "string": "This is Eliza Cassan, reporting to you live -- from Picus."
        },
        {
          "start": "60.00",
          "end": "61.50",
   -->    "string": "[Send]"
        },
        {
          "start": "74.00",
          "end": "84.50",
   -->    "string": "[Rippers,\n\nFor years they thought their secrets were safe. And they were right. Until now.\n\nWe can finally access the Blades. Fire up your NSN kits. Join the fight. Make a difference.\n\nExtract the data. Expose the truth.]"
        }
      ]
    }
  ]

I know that the Arabic line was added by you (and I know that for subtitles the engine allows such frivolities), but I'm curious - are these other texts also transcriptions made by you, or your build of the game actually has them and they could be found in translated files as well.

==============

      "video_id": 52920086065632881,
      "subs": [
        {
          "start": "0.90",
          "end": "6.73",
          "string": "Back now to that confusing tale of life... or is it death? ... coming out of the United States. "
        },

Just as expected, that entry has a full story covered; I don't know why your version got choked on it.

==============

      "video_id": 29878175660670857,
      "subs": [
        {
          "start": "9.07",
          "end": "10.27",
          "string": "I'm unarmed."
        },

Now this one is more weird. Your version has two instances of the same dialogue mapped for two different videos - 29878175660670857 and 40626141800414365. Actually that's not the only case of such double mappings, but in other cases the additional video file is always registered in the same pc_header file where respective subtitles are found. This one, only has these:

60058800799365601 [assembly:/cinematiques/10_cutscenes/30_golemcity/m02_320_cut/m02_320_cut.bk2].pc_binkvid [assembly:/cinematiques/10_cutscenes/30_golemcity/m02_320_cut/m02_320_cut_en.textlist].pc_textlist 10454759125141187 [assembly:/cinematiques/10_cutscenes/30_golemcity/m02_300_cut/m02_300_cut.bk2].pc_binkvid [assembly:/cinematiques/10_cutscenes/30_golemcity/m02_300_cut/m02_300_subtitles_en.textlist].pc_textlist 38816527324695300 [assembly:/cinematiques/10_cutscenes/30_golemcity/m02_325_cut/m02_325_cut.bk2].pc_binkvid [assembly:/cinematiques/10_cutscenes/30_golemcity/m02_325_cut/m02_325_cut_en.textlist].pc_textlist 40626141800414365 [assembly:/cinematiques/10_cutscenes/30_golemcity/m02_305_cut/m02_305_cut.bk2].pc_binkvid [assembly:/cinematiques/10_cutscenes/30_golemcity/m02_305_cut/m02_305_cut_en.textlist].pc_textlist 66131956579257411 [assembly:/cinematiques/10_cutscenes/30_golemcity/m02_690_cut/m02_690_cut.bk2].pc_binkvid [assembly:/cinematiques/10_cutscenes/30_golemcity/m02_690_cut/m02_690_cut_en.textlist].pc_textlist

The textlists are subtitle resources that use each particular video. I'm not sure where your version got the ID for the 29878175660670857.

===============

    {
      "video_id": 67165845848537912,
      "subs": [
        {
          "start": "9.47",
          "end": "11.90",
          "string": "Macready. It's over."
        }
      ]
    }

Another one - in your version it goes by the ID 67165845848537912, in mine 27730722554329936. I don't know if if would be safe to just include both versions and let the engine sort it out.

SeanPesce commented 1 year ago

Hey Kimed, sorry for the extremely delayed response. Excellent work with the extraction software! To answer some of your questions:

EDIT: Also, looks like that FastUpload URL expired. Do you mind re-uploading again at some point? You can attach files to GitHub comments too, if the file size limit isn't exceeded.

KillerBeer01 commented 1 year ago

Long time no see :D.

Sorry for that FastUpload link. I'm doing the file analysis and decompiling on my work computer that has a nice and ready programming environment configured, but also has most of popular and more convenient filesharing platforms blocked for security reasons, so I had to do with what I could find. And I don't think that Github will accept >20MB files in attachments.

Here's the new upload: https://fastupload.io/IIr3PuH36HxTFRM/file . These are raw texts decompiled from current Steam version of the game, not altered in any way to match yours. Full versions of 52920086065632881 are all there.

SeanPesce commented 1 year ago

Awesome, thanks! I'll go through that data at some point, update the language sets in the repo, and add you as a contributor in the Readme.

SeanPesce commented 1 year ago

@Kimed, I just want to make sure I'm crediting you correctly; do you also go by the name "KillerBeer"?

KillerBeer01 commented 1 year ago

Yes, I do... in fact that's the name I go by mostly when online, including my DeusEx related activities, so if you credit me for those, please use it. It's just that when creating this Github account some time ago, I wasn't sure whether I'll be using it for my Internet or real-life persona's needs :D.

SeanPesce commented 1 year ago

Alright, I implemented fixes to the "authoritative" English/Turkish language sets, credited you in the main Readme, and added a utility script for comparing data sets.

Here's the analysis summary from the script when comparing my "authoritative" data set (which contains many manual fixes/improvements) and the default one you extracted from your game:

$ python3 rsrc/scripts/compare_language_sets.py all rsrc/languages/en.json rsrc/languages/KillerBeer/StringsEnglish.json
...
Statistics for rsrc/languages/KillerBeer/StringsEnglish.json:
  Textlists:
    New: 0
    Deleted: 0
    Modified: 1
  Subtitle containers:
    New: 0
    Deleted: 2
    Modified: 12
  Subtitle videos:
    New: 1
    Deleted: 3
    Modified: 13
  Subtitle lines:
    New: 0
    Deleted: 4
    Modified: 32

Here's the full analysis output:

Click to expand ```json $ python3 rsrc/scripts/compare_language_sets.py all rsrc/languages/en.json rsrc/languages/KillerBeer/StringsEnglish.json { "textlists": { "deleted": [], "new": [], "modified": { "36434583166238492": { "deleted": [ 3815676240, 30095272 ], "new": [], "modified": [] } } }, "subtitles": { "deleted": [ "1F0452BD722F7FA4E5941EFD253CE37E", "4A713C9BB76D7C782ADE9C5B68D3FA6C" ], "new": [], "modified": { "97DBC00FD08A9E1ABA8DD54491E1A73E": { "deleted": [], "new": [], "modified": { "68324659120964398": { "deleted": 4, "new": 0, "modified": [] } } }, "5D77985A57E1A7E79A0E44C7BE87D2E3": { "deleted": [], "new": [], "modified": { "30865381867509710": { "deleted": 0, "new": 0, "modified": [ "255.53" ] } } }, "2BD0EB53CEBEBE2F92C48ECBAF7512CD": { "deleted": [ 31647727641083435 ], "new": [ 0 ], "modified": {} }, "F3A46EB2563FF107BF5C8FEA6C54D769": { "deleted": [], "new": [], "modified": { "29863866487701373": { "deleted": 0, "new": 0, "modified": [ "9.33" ] }, "13056493002109790": { "deleted": 0, "new": 0, "modified": [ "15.10" ] }, "46781563123874309": { "deleted": 0, "new": 0, "modified": [ "9.37" ] }, "879358728283020": { "deleted": 0, "new": 0, "modified": [ "9.37" ] } } }, "137F6A3DE580FA17AD5A509B69EB8757": { "deleted": [], "new": [], "modified": {} }, "F44BE437A78BEE6352B368F3DEAD2E30": { "deleted": [], "new": [], "modified": { "47901045612977460": { "deleted": 0, "new": 0, "modified": [ "9.13" ] } } }, "FF3C49467B14D66C47E857D82C7A300F": { "deleted": [ 29878175660670857 ], "new": [], "modified": { "38816527324695300": { "deleted": 0, "new": 0, "modified": [ "20.40" ] }, "60058800799365601": { "deleted": 0, "new": 0, "modified": [ "20.40" ] } } }, "4F83D948793948CB57FC65C84F99EB6E": { "deleted": [], "new": [], "modified": {} }, "A32CFC457C3B81981EB264FB6CDDE240": { "deleted": [ 67165845848537912 ], "new": [], "modified": {} }, "F3E5230B58FB158967A82F7C3AC83892": { "deleted": [], "new": [], "modified": { "52415737138038043": { "deleted": 0, "new": 0, "modified": [ "1.33", "17.80", "23.90", "29.79", "34.8", "40.97", "49.02", "55.09", "57.81", "70.67", "80.67" ] }, "1820183926160237": { "deleted": 0, "new": 0, "modified": [ "61.058" ] }, "52920086065632881": { "deleted": 0, "new": 0, "modified": [ "29.10" ] }, "47590750546332126": { "deleted": 0, "new": 0, "modified": [ "0.60", "6.04", "10.00", "14.15", "26.95", "32.46", "38.74", "44.10", "48.74", "55.25", "61.53" ] } } }, "ED589F63B540CE7DD7AAECB73C7E1C18": { "deleted": [], "new": [], "modified": {} }, "4F0F01263E68A17979C3FE2C880CCA4D": { "deleted": [], "new": [], "modified": {} } } } } Statistics for rsrc/languages/KillerBeer/StringsEnglish.json: Textlists: New: 0 Deleted: 0 Modified: 1 Subtitle containers: New: 0 Deleted: 2 Modified: 12 Subtitle videos: New: 1 Deleted: 3 Modified: 13 Subtitle lines: New: 0 Deleted: 4 Modified: 32 ```
KillerBeer01 commented 1 year ago

And hello once again :)

I see that you've added my translation fix project files to your git. Just FYI, I've recently updated that translation with results of my first in-game testing marathon - perhaps I'll eventually keep adding some more fixes during future playthroughs, but the main body of work is more or less done.

https://fastupload.io/MEci4rIpDVJ2mKf/file

I've added to the archive something else you might be interested in - I ran the English file through MS Word's spellchecker and implemented some of its proposed fixes - even though most of them would be in non-visible barks, I suppose it's better to have them than have not. I also adjusted some timings in cutscenes to closer resemble the audio, splitting some lines where it felt justified, and transcribed Eliza's news where they could be added.

KillerBeer01 commented 1 year ago

By the way, there's a strange glitch I'm observing. In a new game, when 31294679724924917 (arrival to Prague after Dubai) video is playing, original (not edited) subtitle is shown. This is similar to the "first video" bug you've been warning about, only except that this video physically can't be anything but an item in a long row of other videos. I've verified it with the Turkish version of the translation, and the glitch is there as well.