Fixed parsing File offsets / sizes to make all files decode properly

JanFrederick00 commented 1 year ago

I noticed that some files were just a garbled binary mess. This was the result of an incorrectly detected file length (used as the starting value for the decoding).

I had a look at the way the file offsets / lengths are read from the index block. I noticed, that this probably was not the intended way to read these files. (Apologies if other people have already figured this out before me, i couldn't find anything).

I hypothesized, that what was being read (the List of offsets to strings) was only a table to be referenced by index later. I searched for the binary representation of the number of entries in the file and found a matching value at offset 0x14.

Theory: the following bytes must contain the entries' actual descriptions. The records offsets section for the file Weird.ggback4a was significantly shorter than the one from Weird.ggpack1a which I was using. This pattern of data repeats every 0x15 bytes: 02 03 00 00 01 00 04. The rest of the Data is similar, but differs from File to File.

I printed out the list of Strings contained in this File:

[1]: "filename"
[2]: "MasterBank.strings.bank"
[3]: "offset"
[4]: "16"
[5]: "size"
[6]: "54282"
[7]: "MasterBank.bank"
[8]: "54304"
[9]: "5416174"
[10]: "guid"
[11]: "b554baf88ff004c50cc0214575794b8c"

If my Theory was correct that every 0x15 byte entry contained some sort of Dictionary, each one must contain references to the following strings: File 0: "filename", "MasterBank.strings.bank", "offset", "16", "size", "54282" File 1: "filename", "MasterBank.bank", "offset", "54304", "size", "5416174" or - by index: File 0: 1, 2, 3, 4, 5, 6 File 1: 1, 7, 3, 8, 5, 9

The actual pattern of bytes was: File 0: 02 03 00 00 00 01 00 04 02 00 03 00 05 04 00 05 00 05 06 00 02 File 1: 02 03 00 00 00 01 00 04 07 00 03 00 05 08 00 05 00 05 09 00 02

Theory: The indices are stored as 16-bit numbers to save space. I therefore grouped thogether the values that matched the expected numbers with the 0x00 after them.

File 0: 02 03 00 00 00 0100 04 0200 0300 05 0400 0500 05 0600 02 File 1: 02 03 00 00 00 0100 04 0700 0300 05 0800 0500 05 0900 02

My final Theory is that a file entry is structured as follows:

byte 0x02 (purpose unknown) uint32 NumberOfKeyValuePairs KeyValuePair * NumberOfKeyValuePairs

where each KeyValuePair is structured like this: uint16 String list index of the key uint8 unknown - possibly the data type (0x04 between "filename" and MasterBank.bank", 0x05 between "offset" and "54304") uint16 String list index of the value byte 0x02 (purpose unknown)

I have implemented this method and the files that were previously garbled are now correct (for example Credits_en.txt from Weird.ggpack1a). This only applies to RtMI for now, as I have not yet tested whether or not Thimbleweed Park uses the same format.

Further questions:

Where is "guid" = "b554baf88ff004c50cc0214575794b8c" referenced in the File?
I suspect the 0x00000001 at offset 4 references the string "files", as this is otherwise not used.

(Sorry for spamming pull request lately. )

JanFrederick00 commented 1 year ago

A particularly good example are the .lip-files as many of them are the same size and therefore do not contain duplicate string values in the string list. I think this should also fix the graphics that couldn't be decompressed previously (but not the ones that appear blank).

bgbennyboy commented 1 year ago

Please dont call this spam, its brilliant! My reading of the file records for Thimbleweed was always wonky. The idea that some entries were missing was very dodgy and my 'temporary' solution was a massive hack. I think that others have since figured out the format completely but I never got around to going back and updating it. I'm sorry I'm not more pro-actively engaged in this, I'm really busy with work at the moment and I really appreciate the pull requests.

JanFrederick00 commented 1 year ago

Interestingly, these Tools don't seem to work (at least the json tool does not) wit RtMI's Files. I tested my Code with TWPs files, where a few extra bytes seem to be present in the header somewhere (I think there are two before the number of Files in the dictionary - good thing this change only applies to RtMI).

JanFrederick00 commented 1 year ago

I have created a pull request on that other repo, it should now also be able to open RtMI's files. I was able to decode the .json files from RtMI - they seem to be created using TexturePacker (an url to the website was included in the first file I tried). They seem to have changed the GGDict-format so it uses 16-bit string indices, which is why my test with Thimbleweed Park's files failed yesterday.

bgbennyboy commented 1 year ago

Great job again :) I haven't had much time but I've got audio extraction with the .bank files working manually using the dumper I wrote for my Telltale programs. I'll hopefully add that in this weekend and then its time to decide on a new name for the program. "Grumpy Explorer" or "Terrible Toolbox Explorer" are both possibilities.

bgbennyboy / Dinky-Explorer

Fixed parsing File offsets / sizes to make all files decode properly #4