Lattyware / unrpa

A program to extract files from the RPA archive format.
http://www.lattyware.co.uk/projects/unrpa/
GNU General Public License v3.0
590 stars 74 forks source link

Custom loaders. #15

Closed Lattyware closed 5 years ago

Lattyware commented 5 years ago

It appears some Ren'Py games are starting to ship with custom loader scripts for a non-standard variant of RPA archives. unrpa currently can't deal with these archives.

The archives that do this seen in the wild seem to be identifiable as they begin with a ZiX-12B header, not the expected RPA-3.0/RPA-2.0. This appears to be an in-house obfuscation technique.

The route to decoding these files is to use uncompyle6 to turn 'loader.pyo' from the game the archive comes from into readable code. This should allow you to modify unrpa to load the archive. It appears to use a compiled cython module called _string to perform parts of the process.

It appears the system is to use a hard-coded hey in the loader. Ideally, we could identify this type of archive by the header and offer additional tooling to extract that key, alongside an option to manually set the key as an argument.

(This is the root cause of #13).

Edit: There is a script to make extracting these possible, but proper support isn't here yet. See below for details on how to extract an archive of this type now.

Edit: For transparency, I will note I worked out who the developer was who created this technique, and had their name listed here previously. At their request, I have removed a direct reference to them from this post, as it's not really relevant. The partial support for the format and the documentation of the effort here will remain up, however. I am still happy to accept pull requests to solve this issue properly and add full support to unrpa.

Lattyware commented 5 years ago

You can extract these files with the following process:

Now modify unrpa to replace the offset and key under elif self.version == 3: with the values you just obtained. E.g:

            elif self.version == 3:
                line = f.readline()
                parts = line.split()
                offset = 141453332
                key = 572015977

If you force the version to 3, this should now successfully decode the archive.

Clearly, this is a massive pain. It could be made a little nicer by having command line arguments to override key/offset, and further by reverse-engineering the two methods from the _string module, which is presumably a cython module.

Lattyware commented 5 years ago

There is a final step - the directory structure and files are correct, but the extracted images are still scrambled. In the same 2.x environment as above, each file needs to be fixed with this process:

>>> import _string
>>> verificationcode = _string.sha1(...)
>>> rv = open("extracted.png", "rb")
>>> out = open("extracted_fixed.png", "wb")
>>> out.write(_string.run(rv.read(64), verificationcode) + rv.read())

Obviously this is a pain to do by hand. Automating this would be nice, but as we are relying on _string, which is targeted on 2.x, it would have to be a separate script. The ideal solution is reverse-engineering _string, as mentioned previously.

Lattyware commented 5 years ago

The latest version has some extra handling to point the user here if they try and extract an archive of this type.

I have also made a secondary script that automates the above process - it will be somewhat fragile and still relies on the original _string module, meaning it isn't ideal. Reverse engineering that module will still be needed for proper support, but this should make it easier until then.

yetk commented 5 years ago

hello sir i am a rookie.,i read your novel just now . "Take the _string.pyd/_string.so module for your platform from the lib sub-directory for your platform " Could u please tell me where to find the file named "_string.pyd/_string.so",in python environment or ren'py environment? i had find in both of them but couldn't.

Lattyware commented 5 years ago

@yetk The file will be inside the lib folder in the renpy folder of the game you are trying to extract from. The exact path will depend on your platform (Windows, Linux, Mac).

omegalink12 commented 5 years ago

I managed to track down a copy of _string.pyd (MD5 BCD019154309731EB1780546E2E82155) and reverse it. I now know more about cython internals than I ever wanted to. I made a python version of the module that should be easy enough to integrate. I've tested it on random input but not a complete archive. Looking at other games by the same company, they have different loader versions you could also support.

import struct

def sha1(code):
    a=int(filter(str.isdigit,code))+102464652121606009
    b=round(a**(1./3))/23*109
    return int(b)

def offset(offset):
    a=offset[7:5:-1]
    b=offset[:3]
    c=offset[5:2:-1]
    return int(a+b+c,16)

def run(s,key):
    keys=(3621826839565189698,8167163782024462963,5643161164948769306,4940859562182903807,2672489546482320731,8917212212349173728,7093854916990953299)
    out=''
    for i in range(0,len(s),8):
        enc=struct.unpack("<Q",s[i:i+8])[0]
        dec=keys[i%7]^key^enc
        out=out+struct.pack("<Q",dec)
    return out
Lattyware commented 5 years ago

Nice work! I took a look at trying to reverse engineer it myself and it looked like a massive pain in the ass, so congrats on getting through that. As soon as I have the chance I'll take a look at integrating this into unrpa properly, which should be trivial enough given the pure-python implementation you have provided.

I am naturally open to adding any other formats found in the wild, feel free to throw me information about any other ones if you want support added.

Lattyware commented 5 years ago

Resolved as of 2.0.0 (f54191b7746d24a79d6264accdba5ce641364b15).

omegalink12 commented 5 years ago

Minor Note: obfuscated_amount is also loader dependent.

omegalink12 commented 5 years ago

In general, we need to take care with post processing as it only applies to some rpa archives

Lattyware commented 5 years ago

The default postprocessing() implementation is just a pass-through that does nothing - only the ZiX-12B implementation does anything there.

I see how obfuscated_amount could be varied. I'll write up a fix that is dynamic over that. If you have any other examples of the format, I'd love to have some more test cases.

omegalink12 commented 5 years ago

The problem with the ZiX-12B implementation is that it applies postprocessing to all zix archives. However, the loader only applies it to specific ones. The list of said archives is also loader specific. See other VNs by the same company for example.

Lattyware commented 5 years ago

Oh, I see. That's something I didn't even think to look for. I'll fix that along with the other change and push a new version when I get a chance. Let me know if there is anything else you notice, and thanks for all the help getting this one implemented.

It's actually nicer than I thought - I was assuming it was based on archive name, but it's not - the ones without post-processing have a ZiX-12A header instead, so they are just a separate format to be handled without the post-processing.

Lattyware commented 5 years ago

Current concerns should be fixed as of 27ca4a65756be018c84bea22da4cf5c1f18da5ef. Let me know if anything else comes up.