BrianEnigma / BarbieExtract

Prototyping extracting wav files of names from a Barbie game
4 stars 0 forks source link

Python decoder implementation #1

Open ali1234 opened 2 months ago

ali1234 commented 2 months ago

Edit: complete decoder is here

The LST file contains two distinct sections. Each is a table of 20 byte string records, null padded. Each table is preceeded by a 32 bit le integer indicating the number of records. Thus the LST may be parsed with the following code:

import struct

lst = open('data/pinames.lst', 'rb')

n, = struct.unpack('<I', lst.read(4))

names = []

for i in range(n):
    name = lst.read(20)
    name.rstrip(b'\x00')
    names.append(name.rstrip(b'\x00'))

n, = struct.unpack('<I', lst.read(4))
wpks = []

for i in range(n):
    wpk = lst.read(20)
    wpk.rstrip(b'\x00')
    wpks.append(wpk.rstrip(b'\x00'))

print(names)
print(wpks)

The first table contains 50014 entries and seems to be a list of every name recognized by the program.

The second table contains 9342 entries. It is a shorter list of names, but each has the suffix .wpk. I suspect this is the list of names for which audio files exist.

There is no data about how names from the first table get mapped to the second table - there is literally no other data contained in this file. The above parser never seeks - it reads the indicated number of records, and ends up exactly at the end of the file.

Googling "wpk audio format" surfaces the following piece of software: https://github.com/Morilli/bnk-extract - this looks like exactly the right type of thing. It is a tool for unpacking game audio packs...

ali1234 commented 2 months ago

The HIX file begins with a le int size, followed by three tables of that size.

The first table is two byte strings which appear to be the first two characters of each name.

The second is an increasing le int which appears to be an offset into the rest of the file.

The third seems to be a count of how many names begin with those two letters. Summing this table gives 50014.

Therefore this is an index/jump table to speed up searching for a name in the rest of the file.

Code:

import struct

hix = open('data/pinames.hix', 'rb')

n, = struct.unpack('<I', hix.read(4))
print(n)
digrams = []
for i in range(n):
    digrams.append(hix.read(2))

print(digrams)

offset = []
for i in range(n):
    offset.append(struct.unpack('<I', hix.read(4))[0])

count = []

for i in range(n):
    count.append(struct.unpack('<I', hix.read(4))[0])

for x, y, z in zip(digrams, offset, count):
    print(x, y, z)

print(sum(count))
marksteward commented 2 months ago

Yep, this is the first thing that struck me too. There are files in both formats:

-rw-r--r-- 1 mark mark 18794 May 17  1998 CREAK.WAV
-rw-r--r-- 1 mark mark  6356 Aug  3  1999 CREAK.wpk
-rw-r--r-- 1 mark mark 18794 May 17  1998 CREAK2.WAV
-rw-r--r-- 1 mark mark  2669 Aug  3  1999 CREAK2.wpk
-rw-r--r-- 1 mark mark 12964 May 17  1998 CREAK3.WAV
-rw-r--r-- 1 mark mark  3802 Aug  3  1999 CREAK3.wpk

Bink sounds likely as the format, as there's a library called BINKW32.DLL that exports _BinkAudioDecompress@20.

Unfortunately, bnk-extract won't work because it requires a magic of "r3d2", while we have (e.g.):

00000000: 0001 ff03 fd06 fa0a f60f f115 eb1c e47f  ................
00000010: 2c00 0000                                ,...

libavcode supports some Bink stuff too , including one where the last letter of the version is b.

Looking at the MD5 sums of https://github.com/vogonsorg/radgametools/tree/main/binkw32, we have the 0.8a version of binkw32.dll, which is the first public release. The game uses _BinkOpen@8 to play the .bik files in DATA, and these still sort of play with the latest version of RAD Video Tools.

ali1234 commented 2 months ago

"Bink Audio" is definitely a format I have heard of before.

ali1234 commented 2 months ago

The remainder of the HIX file contains 28 byte records. 20 bytes of null padded string, followed by 8 bytes of unknown data.

The index table contains 295 entries, each is 2 + 4 + 4 bytes. The first few entries look like:

b"a'" 2954 14
b'aa' 3346 13
b'ab' 3710 102
b'ac' 6566 8

Notice that 2954 = (295 10) + 4, so it is the exact offset where those records start. Then 3346 = 2954 + (28 14), so the offset of the first record beginning with 'aa'. The count field is redundant since we can calculate it from the record size. :)

Once again the parser lands exactly on the end of the file with the following records:

(b'zyzie', b'(#r\x04k(\x00\x00')
(b'zyzy', b'(#r\x04k(\x00\x00')

So there is no other data in this file except the 8 unknown bytes per name record.

ali1234 commented 2 months ago

Interpreting the unknown 8 bytes as a pair of LE ints gives this:

(b"a'leeshan", 0, 9136)
(b"a'leeshanh", 0, 9136)
(b"a'leeshann", 0, 9136)
(b"a'leeshon", 0, 9136)
(b"a'lelia", 9136, 8987)
(b"a'leliah", 9136, 8987)
(b"a'lishan", 0, 9136)
(b"a'lishanh", 0, 9136)
(b"a'lishann", 0, 9136)
(b"a'lishon", 0, 9136)
...
(b'zyporah', 74563130, 7827)
(b'zyra', 74570957, 9422)
(b'zyrah', 74570957, 9422)
(b'zyta', 74670735, 8898)
(b'zytah', 74670735, 8898)
(b'zyvah', 74580379, 8589)
(b'zyzey', 74588968, 10347)
(b'zyzi', 74588968, 10347)
(b'zyzie', 74588968, 10347)
(b'zyzy', 74588968, 10347)

Two things stand out here:

  1. The first value mostly increases, and 74588968 is extremely close to the length of the HUG file, while the second value is always small. This looks like an offset and length, and for example a'leeshon appears to end exactly where 'a'lelia starts.

  2. Some similar looking names have the same values. So presumably this is where homophones are handled, and not the LST file.

The highest "offset, length" seen in the file does not occur at the end and is (74979189, 7078). 74979189 + 7078 = 74986267, the exact length of the HUG file. Analysis shows there are no gaps - every part of the hug file is referred to by at least one name record.

The number of unique offsets in the file is 9342, the same as the number of wpk records in the LST file. Therefore I would guess that the LST file is an artifact of the tool used to create the HUG and HIX (Hug IndeX?), and the wpk files are the original audio files used to build it. Therefore the wpk records can be considered the canonical spelling for each homophone.

ali1234 commented 2 months ago

Matching all the data extracted so far gives us a list of (hug offset, length) (canonical name, [homophones, ...])

This is the beginning and end of it:

(0, 9136) (b"a'lishan.wpk", [b"a'leeshan", b"a'leeshanh", b"a'leeshann", b"a'leeshon", b"a'lishan", b"a'lishanh", b"a'lishann", b"a'lishon", b"a'lyshan", b"a'lyshanh", b"a'lyshann", b"a'lyshon", b'aleeshan', b'aleeshanh', b'aleeshann', b'aleeshon', b'alishan', b'alishanh', b'alishann', b'alishon', b'alyshan', b'alyshanh', b'alyshann', b'alyshon'])
(9136, 8987) (b"a'lelia.wpk", [b"a'lelia", b"a'leliah", b'alelia', b'aleliah'])
(18123, 7354) (b'aliyah.wpk', [b'aaleeyah', b'aaleyah', b'aaliyah', b'aleeah', b'aleeya', b'aleeyah', b'alia', b'aliah', b'alieah', b'aliya', b'aliyah', b'alyyah'])
(25477, 7007) (b'aarica.wpk', [b'aarica', b'aarika', b'aaryca', b'erica', b'ericah', b'ericka', b'erickah', b'erika', b'erikah', b'errica', b'eryca', b'erycah', b'erycka', b'eryckah', b'eryka', b'erykah'])
...
(74952752, 8845) (b'zuri.wpk', [b'zurey', b'zuri', b'zurie', b'zury'])
(74961597, 11118) (b'zuzia.wpk', [b'zuzia', b'zuziah'])
(74972715, 6474) (b'zykia.wpk', [b'zykia', b'zykiah'])
(74979189, 7078) (b'zylpha.wpk', [b'zylpha', b'zylphah'])
ali1234 commented 2 months ago

The following code parses LST and HIX, then fetches the raw data from the HUG and plays it. You may optionally supply a name on the command line - otherwise the default is zara. It does not decode the audio, so it sounds terrible (volume warning). It does sound like it is fetching the right names, so all that remains is to fill in the pcm function with a correct decoding.

import pathlib
import struct
from collections import defaultdict

class HUG:
    def __init__(self, path):
        self._path = pathlib.Path(path)
        _, wpks = self.read_lst()
        _, _, _, hugmap = self.read_hix()
        self._positions = {}
        reverse_tmp = {}
        for (offset, length), canonical_name in zip(sorted(hugmap.keys()), wpks):
            self._positions[canonical_name[:-4]] = (offset, length)
            reverse_tmp[(offset, length)] = canonical_name[:-4]

        self._names = {}
        self._homophones = defaultdict(list)
        for (offset, length), names in hugmap.items():
            for name in names:
                self._names[name] = reverse_tmp[offset, length]
                self._homophones[reverse_tmp[offset, length]].append(name)

        self._hug_data = open(self._path / 'pinames.hug', 'rb').read()

    def canonical(self, name):
        return self._names[name]

    def position(self, name):
        return self._positions[self.canonical(name)]

    def homophones(self, name):
        return self._homophones[self.canonical(name)]

    def raw_data(self, name):
        offset, length = self.position(name)
        return self._hug_data[offset:offset+length]

    def pcm(self, name):
        # TODO: decode it!
        return self.raw_data(name)

    def play(self, name):
        import miniaudio, time
        with miniaudio.PlaybackDevice(output_format=miniaudio.SampleFormat.UNSIGNED8,
                                      nchannels=1, sample_rate=22050,
                                      buffersize_msec=500) as device:
            pcm = self.pcm(name)
            stream = miniaudio.stream_raw_pcm_memory(pcm, 1, 1)
            device.start(stream)
            time.sleep((len(pcm) + 1000) / 22050)

    def read_lst(self):
        with open(self._path / 'pinames.lst', 'rb') as lst:
            n, = struct.unpack('<I', lst.read(4))
            names = [name.rstrip(b'\x00') for name in struct.unpack("20s" * n, lst.read(20 * n))]
            n, = struct.unpack('<I', lst.read(4))
            wpks = [wpk.rstrip(b'\x00') for wpk in struct.unpack("20s" * n, lst.read(20 * n))]
            return names, wpks

    def read_hix(self):
        with open(self._path / 'pinames.hix', 'rb') as lst:
            n, = struct.unpack('<I', lst.read(4))
            # we don't care about this data
            digrams = struct.unpack("2s" * n, lst.read(2 * n))
            offsets = struct.unpack(f'<{n}I', lst.read(4 * n))
            counts = struct.unpack(f'<{n}I', lst.read(4 * n))
            total = sum(counts)
            hugmap = defaultdict(list)
            for i in range(total):
                name, hug_offset, hug_length = struct.unpack('<20sII', lst.read(28))
                hugmap[hug_offset, hug_length].append(name.rstrip(b'\x00'))
            return digrams, offsets, counts, hugmap

    def dump_all(self):
        for name, (offset, length) in sorted(self._positions.items(), key=lambda x: x[1]):
            print(offset, length, name, self.homophones(name))

if __name__ == '__main__':
    import sys
    h = HUG('data/')
    try:
        name = sys.argv[1].encode('latin1') # what is the true encoding?
    except IndexError:
        name = b'zara'
    h.play(name)
    print(h.canonical(name))
    print(h.homophones(name))
    print(h.position(name))
ali1234 commented 2 months ago

Inspecting the actual binary blobs, every one begins the same way. The RIFF header is not the beginning of the data, there are 20 bytes before it, like this:

00000000  00 01 ff 03 fd 06 fa 0a  f6 0f f1 15 eb 1c e4 7f  |................|
00000010  2c 00 00 00 52 49 46 46  58 40 00 00 57 41 56 45  |,...RIFFX@..WAVE|

The looks like 8x LE halfs, increasing monotonically, and then an int, which would decode as:

(256, 1023, 1789, 2810, 4086, 5617, 7403, 32740, 44)

These values are identical for every file. This looks like a compression wrapper to me.

marksteward commented 2 months ago

Yep, I figured the same. It looks like they might actually include the RIFF header in theirs, making it 0x42 bytes or so before the compressed data starts.

The decompression function is at 0x475460 in d2.exe, and looks to be char* DecompressAudio(char* in, uint32_t* outlen). I don't have time to test that tonight but anyone else is welcome to.

I wonder if this is some sort of early internally-shared static library from RAD Game Tools, earlier than any public tooling supports.

ali1234 commented 2 months ago

Okay I grabbed the full ISO and I can confirm that the blobs in the HUG are wpk format, same as the ones in the AUDIO subdirectory. First 20 bytes are identical on all files. We have compressed and uncompressed versions of the same files so figuring out the compression shouldn't be too hard...

marksteward commented 2 months ago

I suspect it will be, it's a proprietary format. It might ultimately look similar to the Bink audio format supported by libavcodec but it's a bunch of work to confirm that.

It'll be much easier to just call the code in d2 directly. The info above should be enough to give that a go.

ali1234 commented 2 months ago

Here's what i discovered so far:

ali1234 commented 2 months ago

I have the decoder working now. It is partly based on a decompile so it is a bit rough. It seems like WPK is actually lossy and rounds the samples to 7 bits. Decoding CREAK.WPK does not produce exactly the original WAV, but it is very close. The names all sound good anyway.

The final thing left to do is figure out the string encoding for the handful of names that have non-ascii characters.

Here is a complete tool to list, extract, or play any name. Depends on click, numpy, and optionally miniaudio, tqdm.

Run barbie.py --help

#!/usr/bin/env python3

import pathlib
import struct
from collections import defaultdict

import numpy as np

class HUG:
    def __init__(self, path):
        self._path = pathlib.Path(path)
        _, wpks = self.read_lst()
        _, _, _, hugmap = self.read_hix()
        self._positions = {}
        reverse_tmp = {}
        for (offset, length), canonical_name in zip(sorted(hugmap.keys()), wpks):
            self._positions[canonical_name[:-4]] = (offset, length)
            reverse_tmp[(offset, length)] = canonical_name[:-4]

        self._names = {}
        self._homophones = defaultdict(list)
        for (offset, length), names in hugmap.items():
            for name in names:
                self._names[name] = reverse_tmp[offset, length]
                self._homophones[reverse_tmp[offset, length]].append(name)

        self._hug_data = open(self._path / 'pinames.hug', 'rb').read()

    def canonical(self, name):
        return self._names[name]

    def position(self, name):
        return self._positions[self.canonical(name)]

    def homophones(self, name):
        return self._homophones[self.canonical(name)]

    def raw_data(self, name):
        offset, length = self.position(name)
        return self._hug_data[offset:offset+length]

    def wav(self, name):
        raw = self.raw_data(name)
        table = struct.unpack("<16b", raw[:16])
        header_len, = struct.unpack("<I", raw[16:20])
        header = raw[20:20+header_len]
        assert len(header) == 44
        orig_len, = struct.unpack('<I', header[-4:])
        compressed_len, = struct.unpack('<I', raw[20 + header_len:20 + header_len + 4])
        compressed = raw[20 + header_len + 4:]
        original = np.empty((orig_len + 0x80,), dtype=np.uint8)
        original[:] = 0x80 # fill with silence
        assert len(compressed) == compressed_len
        bVar8 = 0x80
        local_14c = 0x80
        orig_pos = 0
        comp_pos = 0
        if compressed_len > 0:
            while True:
                if orig_pos >= orig_len:
                    break
                bVar3 = int(compressed[comp_pos] >> 4)
                if bVar3 == 0xf:
                    bVar3 = (compressed[comp_pos] << 4) & 0xff
                    iVar12 = comp_pos + 1
                    if (bVar3 != 0xf0) or ((compressed[iVar12] & 0xf0) != 0xf0):
                        bVar8 = bVar3 | compressed[iVar12] >> 4
                        original[orig_pos] = bVar8
                        local_14c = bVar8
                        orig_pos += 1
                    else:
                        bVar3 = ((compressed[iVar12] << 4) & 0xff) | (compressed[comp_pos + 2] >> 4)
                        iVar12 = comp_pos + 2
                        if bVar3 != 0:
                            original[orig_pos:orig_pos+bVar3] = bVar8
                            orig_pos += bVar3
                            bVar8 = local_14c
                else:
                    iVar2 = int(local_14c) + table[bVar3]
                    iVar12 = comp_pos
                    if iVar2 < 0x100:
                        if iVar2 < 0:
                            bVar8 = 0
                            local_14c = 0
                            original[orig_pos] = 0
                        else:
                            bVar8 = int(bVar8) + table[bVar3]
                            local_14c = bVar8
                            original[orig_pos] = bVar8
                    else:
                        bVar8 = 0xff
                        local_14c = 0xff
                        original[orig_pos] = 0xff
                    orig_pos += 1

                if (compressed[iVar12] & 0xf) == 0xf:
                    comp_pos = iVar12 + 1
                    bVar3 = compressed[comp_pos]
                    if bVar3 != 0xff:
                        original[orig_pos] = bVar3
                        local_14c = bVar3
                        orig_pos += 1
                        bVar8 = bVar3
                    else:
                        comp_pos = iVar12 + 2
                        bVar3 = compressed[comp_pos]
                        if bVar3 != 0:
                            original[orig_pos:orig_pos+bVar3] = bVar8
                            orig_pos += int(bVar3)
                            bVar8 = local_14c
                else:
                    comp_pos = iVar12
                    iVar2 = local_14c + table[compressed[comp_pos] & 0xf]
                    if iVar2 < 0x100:
                        if iVar2 < 0:
                            bVar3 = 0
                            local_14c = 0
                            original[orig_pos] = 0
                        else:
                            bVar3 = bVar8 + table[compressed[iVar12] & 0xf]
                            local_14c = bVar3
                            original[orig_pos] = bVar3
                    else:
                        bVar3 = 0xff
                        local_14c = 0xff
                        original[orig_pos] = 0xff
                    orig_pos += 1
                    bVar8 = bVar3
                comp_pos += 1
                if comp_pos >= compressed_len:
                    break

        return header, original

    def play(self, name):
        import miniaudio, time
        header, pcm = self.wav(name)
        with miniaudio.PlaybackDevice(output_format=miniaudio.SampleFormat.UNSIGNED8,
                                      nchannels=1, sample_rate=22050,
                                      buffersize_msec=500) as device:
            stream = miniaudio.stream_raw_pcm_memory(pcm, 1, 1)
            device.start(stream)
            time.sleep((len(pcm) + 1000) / 22050)

    def read_lst(self):
        with open(self._path / 'pinames.lst', 'rb') as lst:
            n, = struct.unpack('<I', lst.read(4))
            names = [name.rstrip(b'\x00') for name in struct.unpack("20s" * n, lst.read(20 * n))]
            n, = struct.unpack('<I', lst.read(4))
            wpks = [wpk.rstrip(b'\x00') for wpk in struct.unpack("20s" * n, lst.read(20 * n))]
            return names, wpks

    def read_hix(self):
        with open(self._path / 'pinames.hix', 'rb') as lst:
            n, = struct.unpack('<I', lst.read(4))
            digrams = struct.unpack("2s" * n, lst.read(2 * n))
            offsets = struct.unpack(f'<{n}I', lst.read(4 * n))
            counts = struct.unpack(f'<{n}I', lst.read(4 * n))
            total = sum(counts)
            hugmap = defaultdict(list)
            for i in range(total):
                name, hug_offset, hug_length = struct.unpack('<20sII', lst.read(28))
                hugmap[hug_offset, hug_length].append(name.rstrip(b'\x00'))
            return digrams, offsets, counts, hugmap

    @property
    def names(self):
        return sorted(self._positions.keys())

if __name__ == '__main__':
    import click
    default_encoding = "latin1"

    @click.group()
    def barbie():
        pass

    @barbie.command()
    @click.argument("name", type=str, default='zara')
    @click.option("-d", "--data", type=click.Path(exists=True), default="data/")
    @click.option("-e", "--encoding", type=str, default=default_encoding)
    def play(name, data, encoding):
        """Play a name using miniaudio"""
        print(name, data)
        name = name.encode(encoding)
        h = HUG(data)
        print('Canonical:', h.canonical(name).decode(encoding))
        print('Homophones:', ', '.join(n.decode(encoding) for n in h.homophones(name)))
        h.play(name)

    @barbie.command()
    @click.argument("name", type=str, default='zara')
    @click.option("-d", "--data", type=click.Path(exists=True), default="data/")
    @click.option("-e", "--encoding", type=str, default=default_encoding)
    def lookup(name, data, encoding):
        """Look up a name in the index"""
        print(name, data)
        name = name.encode(encoding)
        h = HUG(data)
        print('Canonical:', h.canonical(name))
        print('Homophones:', h.homophones(name))
        print('Position:', h.position(name))

    @barbie.command()
    @click.argument("outdir", type=click.Path(writable=True), default="out/")
    @click.option("-d", "--data", type=click.Path(exists=True), default="data/")
    @click.option("-e", "--encoding", type=str, default=default_encoding)
    @click.option("--wpk", is_flag=True, help="Dump compressed WPK files instead of WAV")
    def dump(outdir, data, encoding, wpk):
        """Dump all names to WAV files (or WPK)"""
        outdir = pathlib.Path(outdir)
        h = HUG(data)
        outdir.mkdir(parents=True, exist_ok=True)
        names = h.names
        try:
            from tqdm import tqdm
            names = tqdm(names)
        except ImportError:
            pass
        for name in names:
            fn = (outdir / name.decode(encoding))
            if wpk:
                fn.with_suffix('.wpk').write_bytes(h.raw_data(name))
            else:
                fn.with_suffix('.wav').write_bytes(b''.join(h.wav(name)))

    @barbie.command(name="list")
    @click.option("-d", "--data", type=click.Path(exists=True), default="data/")
    @click.option("-e", "--encoding", type=str, default=default_encoding)
    def _list(data, encoding):
        """List all names in the index"""
        h = HUG(data)
        for name in h.names:
            print(name.decode(encoding), '=', ', '.join(n.decode(encoding) for n in h.homophones(name)))

    barbie()
marksteward commented 2 months ago

That's a lot simpler than I expected! I didn't dig into the two calls the decrypt function makes, but it looks like they're both memory allocation functions. I've also had a quick look and can't find any other code around this area so it's probably not a static library but a custom compression function.

I'd suggest something like the following renames:

bVar3 -> c bVar8 -> out_c local_14c -> last_c iVar2 -> sum_c table -> codebook/dictionary iVar12 -> next_pos

Doesn't help that much but they're kind of overloaded. I'd also swap the if < 0x100 logic around, so you have if sum_c >= 0x100: ... elif sum_c < 0: ....

BrianEnigma commented 2 months ago

This is all fantastic work! I haven't fully mentally parsed it yet (and likely can't dig in deep until after work this evening), but would love to pull this into the repo.

kevinmarks commented 2 months ago

The final thing left to do is figure out the string encoding for the handful of names that have non-ascii characters.

I'd suggest using ftfy to work that out (it's probably win1252, but ftfy will confirm that)

ali1234 commented 2 months ago

Here's all the non-ascii names with cp1252 decoding. It looks plausible except for "cŽstorah", which is a version of "castora"

b'd\x92laine' d’laine
b'ni\xedmah' niímah
b'ny\xedmah' nyímah
b'o\xedkeeshiah' oíkeeshiah
b'o\xedkeshia' oíkeshia
b'c\x8estorah' cŽstorah
b'o\xedkeashia' oíkeashia
b'adoraci\xf3n' adoración
b'o\xedqueshiah' oíqueshiah
b'o\xedqueshia' oíqueshia
b'o\xedkeshiah' oíkeshiah
b'd\x92layne' d’layne
b'o\xedkieshia' oíkieshia
b'o\xedkeashiah' oíkeashiah
b'nee\xedmah' neeímah
b'd\x92juana' d’juana
b'o\xedkeeshia' oíkeeshia
b'o\xedkieshiah' oíkieshiah
Canonical: castora
Homophones: castora, castorah, cŽstorah, kastora, kastorah