PrismaticFlower / swbf-unmunge

A totally groovy tool for getting access to game files for a game from 2005.
MIT License
26 stars 4 forks source link

Better string recovery #23

Closed BAD-AL closed 3 years ago

BAD-AL commented 3 years ago
PrismaticFlower commented 3 years ago

I'm very sorry about the slow response! I didn't have a lot of headspace these past couple days.

There are some changes I'd like to request before merging this to bring it inline with the style of the rest of code base. They're basically entirely my fault for not having contributor guidelines written up however.

Before I do that though there is just a design element for the feature I want to talk about. I notice that this appears to introduce a requirement on dictionary.txt being carried alongside the tool. I'm unsure about this requirement as I feel it's surprising for a command line tool to not work because a .txt file was missing from it's directory (a user might not think to copy it around).

For end users I feel like it could be better to maintain the existing builtin list of hashes and provide a command line option to overlay extra strings on top of that. The way to best design that is bit trickier of course.

One possible way would be to add a Swbf_fnv_hashes class that could be constructed in extract_file and passed down through the unmunge process the same way File_saver currently is. https://github.com/SleepKiller/swbf-unmunge/blob/d0d2ab5474982adf5899389ccbb0b8dcf3bd9325/src/main.cpp#L29-L48

That classes lookup method could then first try looking in the existing immutable map and if it failed to find the hash in that it could fallback to a map of strings from a file that had been passed through on the command line. And if no file had been passed through on the command line functionality would continue to be as it currently is.

A design such as that would I feel avoid the issue of introducing a subtle dependency on a text file and also allow future extensibility. Such as potentially a feature that collected and built a list of strings from all the .lvls in a mod's addon folder.

All that said I get that that could be quite a bit more work than you'd have wanted to do in opening this PR. So if that is case just say, as there are simpler designs that while less extensible in the future will be perfectly serviceable currently. Such as a half-way between what you have now and how it currently works in 1.2.

BAD-AL commented 3 years ago

One thing that I intended to add but forgot to was the lvl's filename to hashes (I can do that now). I played around with ripping through the .lvl file looking for strings, but that effort ended up with minimal helpful results. Adding the lvl's filename to the hashes is what ended up being most helpful.

I do understand about the concern about the tag-along dictionary and believe that I have an idea on how to accomplish your request elegantly without too much effort.

BAD-AL commented 3 years ago

One more thing to consider about using an external dictionary as a dependency that the program already is dependent on files in the same directory [fmt.dll, tbb.dll]. So when the user wants to transport the program elsewhere, there is already baggage they must carry around with it.

BAD-AL commented 3 years ago

Upon further review, we could improve the results by processing the embedded object/odf files first and adding the property values to the hashed strings map. I'm not exactly sure how much re-factoring we'd end up doing for this, but it seems like a non-trivial amount.

Or we could print out the object property values that were previously unknown (with a command line option) so that the user could create their own dictionary with the 'previously unknown' strings and try again (this is certainly easier).

PrismaticFlower commented 3 years ago

One more thing to consider about using an external dictionary as a dependency that the program already is dependent on files in the same directory [fmt.dll, tbb.dll]. So when the user wants to transport the program elsewhere, there is already baggage they must carry around with it.

Yes that is very true. This wasn't the case when I first made swbf-unmunge but I think vcpkg now supports static linking all the dependencies the tool needs. Assuming that is the case I feel it would be a good thing to enable, it'd keep the releases cleaner and make it easier to add other dependencies in the future without fear of cluttering the release folder.

I also do think that people are more likely to think to carry along the .dll files than any .txt files. But I could be wrong and people may just do Ctrl + A + Ctrl + C in any case though with static linking that would moot.

Upon further review, we could improve the results by processing the embedded object/odf files first and adding the property values to the hashed strings map. I'm not exactly sure how much re-factoring we'd end up doing for this, but it seems like a non-trivial amount.

Or we could print out the object property values that were previously unknown (with a command line option) so that the user could create their own dictionary with the 'previously unknown' strings and try again (this is certainly easier).

Something like is what I have in mind with building a list of strings from an .lvl. However I agree that the needed refactoring with attaching such functionality onto the current readers would be problematic. Which is to say nothing of readability concerns after the fact. Bolting on such functionality isn't necessarily going to be clean.

What I sort of have in mind is a preprocessing step that a user could run before extracting any .lvls. Where swbf-unmunge gets a directory and iterates through the .lvls it can find there. This step would build a list of strings found in the .lvls and then that list could be as you say saved out and then passed back into it for use during extraction.

For class definitions I think that we'd probably need to go a step further than simply collecting all the strings form them and actually have some minimal logic to read them properly, only grabbing strings from properties we know to contain them and parsing the strings on properties correctly.

Take this as one random example. In this case it's like cis_fly_droidfighter_shift_up_low_property is the part of the string we're interested in so we want to correctly take that part of the string only. I think there are other cases as well they'd need special handling to correctly grab the string.

BoostSound              = "cis_fly_droidfighter_shift_up_low_property 0.11 1"

We probably especially don't want superfluous strings from properties like AimerPitchLimits just in case we get hash collisions, we don't want to accidentally give a user a file named -90.0 90.0.fx. Just a contrived made up example of course but the problem is worth at least keeping in mind, however unlikely it is to happen.

I feel like I have something else I was going to add here as well but it's completely slipped my mind atm, lol. If it comes back to me I'll add another comment with it.

BAD-AL commented 3 years ago

Here is the summary of the requirements I'm envisioning with this discussion :

  1. Keep swbf_fnv_hashes.cpp as as it is and give the user the option of using a string dictionary.
  2. Add an option to generate a dictionary for a given directory (containing lvl files)
    • Process sub directories too
  3. When generating a dictionary, do not save output files (.tga, .msh, ...); only save the dictionary file.
  4. If the dictionary file already exists, append to it.
  5. When adding found strings and a string contains a space (' ') in it, add the substring(0, index_of_space)
    • Example: When you find: "cis_fly_droidfighter_shift_up_low_property 0.11 1" only add: "cis_fly_droidfighter_shift_up_low_property"
  6. Try to setup static linking to reduce baggage (fmt.dll tbb.dll).
    • Result: Failure
      • Error with 'DirectXTex' package "CMake Error at scripts/cmake/vcpkg_check_linkage.cmake:48 (message): directxtex only supports dynamic crt linkage"

Let me know if there's something else, I think these changes won't take too long.

PrismaticFlower commented 3 years ago
1. Keep swbf_fnv_hashes.cpp as as it is and give the user the option of using a string dictionary.

2. Add an option to generate a dictionary for a given directory (containing lvl files)

   * Process sub directories too

3. When generating a dictionary, do not save output files (.tga, .msh, ...); only save the dictionary file.

4. If the dictionary file already exists, append to it.

Yup sounds good. Just to clarify as well I feel like the code that collects strings and the extraction code (extract_***.cpp files) should be fully separate.

5. When adding found strings and a string contains a space (' ') in it, add the substring(0, index_of_space)

   * Example:
     When you find:  "cis_fly_droidfighter_shift_up_low_property 0.11 1"
     only add:  "cis_fly_droidfighter_shift_up_low_property"

Yeah. One thing I will reiterate though is I am concerned about the possibility of collisions from class properties. It is unlikely to ever be a problem but it's not impossible and would be very confusing for someone that ran into it. So while handling those care should be taken to only grab strings from properties of interest. Possibly validating that they don't any reserved characters before adding them to the dictionary, the last thing we would want is a file failing to save because it got assigned a name with a \ in it, in that case we'd want it to gracefully fallback to the current behavior.

6. Try to setup static linking to reduce baggage (fmt.dll tbb.dll).

   * Result: Failure

     * Error with 'DirectXTex' package
       "CMake Error at scripts/cmake/vcpkg_check_linkage.cmake:48 (message):
       directxtex only supports dynamic crt linkage"

Ah you might need to run git pull and then bootstrap-vcpkg.bat again. I'm pretty sure it should work as I recently switched another project over to static linking DirectXTex.

Anyway I feel like this can be done in two cleanly separate parts. Ignoring static linking that is, which is only really indirectly apart of this and can just be done at "some point" before the next release. First the changes to swbf_fnv_hashes.cpp to support loading a secondary set of strings from a file to use. And separately the ability to create a list of strings from a directory of .lvl files. Both things can be worked on and merged separately since they have no direct dependency on each other.

BAD-AL commented 3 years ago

Based on the shipped ODF file contents I'm going to push back on the requirement on 'reserved characters'. In some of the shipped odf files /lvls there is stuff like:

\\ChunkGeometryName = "yav_bldg_watchtower_hulk"
\\ChunkPhysics = "STATIC"
\\ChunkTrailEffect      = "com_sfx_chunktrail_sm"
\\ChunkSmokeEffect  = "com_sfx_smokeplume"
\\ChunkSmokeNodeName    = "hp_smoke1"

These look like they are 'commented out' and I guess they kind of are. But they get put in the object file. Example (unmunged tur_bldg_tower.odf):

tur\odf\tur_bldg_tower.odf:0xdd807455 = "yav_bldg_watchtower_hulk"
tur\odf\tur_bldg_tower.odf:0x8b6e7dc3 = "STATIC"
tur\odf\tur_bldg_tower.odf:0x88e0a44b = "com_sfx_chunktrail_sm"
tur\odf\tur_bldg_tower.odf:0xc74113de = "com_sfx_smokeplume"
tur\odf\tur_bldg_tower.odf:0x7aded410 = "hp_smoke1"

If we do not correctly resolve "\\ChunkGeometryName", the user will be left wondering what property actually gets the displayed value.

PrismaticFlower commented 3 years ago

I was mostly thinking of strings acquired through looking .lvl files. Class property names can't be acquired that way. It does raise interesting point that I had not considered though.

So it looks like the string list does already have some entries for badly commented properties. https://github.com/SleepKiller/swbf-unmunge/blob/d0d2ab5474982adf5899389ccbb0b8dcf3bd9325/src/swbf_fnv_hashes.cpp#L46-L53

There are a couple ways to look at this really. The first is that it's a good thing and it let's some .odf files be reconstructed more correctly. Another way is that it's a bad thing and needlessly increases the risk of hash collisions going forward as they don't actually affect the functionality of the resulting object class in anyway.

I'm of two minds here because a sporadic failure from the tool attempting to name a config file \\ChunkGeometryName wouldn't be great but it is unlikely to happen. And it could potentially be made a nonissue by simply looking up into the dynamic user supplied map first before trying the constant builtin one. So if we do go that way (or something with an equivalent effect) I'm probably happy to go ahead and add these to the current builtin map if you feel it's worth for aesthetics.

"\\ChunkGeometryName"_fnvp,
"\\ChunkPhysics"_fnvp,
"\\ChunkTrailEffect"_fnvp,
"\\ChunkSmokeEffect"_fnvp,
"\\ChunkSmokeNodeName"_fnvp,
BAD-AL commented 3 years ago

For now, I'm just committing:

  1. The 'dictionary read' stuff.
  2. Strings that improve the file unmunge process.

I'd like to create a separate pr for the dictionary build after 'dictionary read' is integrated.

PrismaticFlower commented 3 years ago

Sorry about the delay. I'll try to start going through this today as I'm able to.

BAD-AL commented 3 years ago

I pushed a couple commits to incorporate your suggestions. I resolved most of the comments with your suggestions. Just a bit more conversation left on this one though.

BAD-AL commented 3 years ago

Pushed in another commit. Please review the mutex handling in 'swbf_fnv_hashes.cpp' to ensure that I properly addressed your concerns.

BAD-AL commented 3 years ago

Committed a few more changes; I think all the concerns have been addressed. Please review to ensure.

PrismaticFlower commented 3 years ago

I'm just about to hit merge on this but a quick note is I have removed the extra section in anticipation of writing a proper CONTRIBUTING.md.

It's pretty much all looking good though and seems to work great. Thanks!