Open ali1234 opened 2 months ago
The HIX file begins with a le int size, followed by three tables of that size.
The first table is two byte strings which appear to be the first two characters of each name.
The second is an increasing le int which appears to be an offset into the rest of the file.
The third seems to be a count of how many names begin with those two letters. Summing this table gives 50014.
Therefore this is an index/jump table to speed up searching for a name in the rest of the file.
Code:
import struct
hix = open('data/pinames.hix', 'rb')
n, = struct.unpack('<I', hix.read(4))
print(n)
digrams = []
for i in range(n):
digrams.append(hix.read(2))
print(digrams)
offset = []
for i in range(n):
offset.append(struct.unpack('<I', hix.read(4))[0])
count = []
for i in range(n):
count.append(struct.unpack('<I', hix.read(4))[0])
for x, y, z in zip(digrams, offset, count):
print(x, y, z)
print(sum(count))
Yep, this is the first thing that struck me too. There are files in both formats:
-rw-r--r-- 1 mark mark 18794 May 17 1998 CREAK.WAV
-rw-r--r-- 1 mark mark 6356 Aug 3 1999 CREAK.wpk
-rw-r--r-- 1 mark mark 18794 May 17 1998 CREAK2.WAV
-rw-r--r-- 1 mark mark 2669 Aug 3 1999 CREAK2.wpk
-rw-r--r-- 1 mark mark 12964 May 17 1998 CREAK3.WAV
-rw-r--r-- 1 mark mark 3802 Aug 3 1999 CREAK3.wpk
Bink sounds likely as the format, as there's a library called BINKW32.DLL that exports _BinkAudioDecompress@20
.
Unfortunately, bnk-extract won't work because it requires a magic of "r3d2", while we have (e.g.):
00000000: 0001 ff03 fd06 fa0a f60f f115 eb1c e47f ................
00000010: 2c00 0000 ,...
libavcode supports some Bink stuff too , including one where the last letter of the version is b
.
Looking at the MD5 sums of https://github.com/vogonsorg/radgametools/tree/main/binkw32, we have the 0.8a version of binkw32.dll, which is the first public release. The game uses _BinkOpen@8
to play the .bik files in DATA, and these still sort of play with the latest version of RAD Video Tools.
"Bink Audio" is definitely a format I have heard of before.
The remainder of the HIX file contains 28 byte records. 20 bytes of null padded string, followed by 8 bytes of unknown data.
The index table contains 295 entries, each is 2 + 4 + 4 bytes. The first few entries look like:
b"a'" 2954 14
b'aa' 3346 13
b'ab' 3710 102
b'ac' 6566 8
Notice that 2954 = (295 10) + 4, so it is the exact offset where those records start. Then 3346 = 2954 + (28 14), so the offset of the first record beginning with 'aa'. The count field is redundant since we can calculate it from the record size. :)
Once again the parser lands exactly on the end of the file with the following records:
(b'zyzie', b'(#r\x04k(\x00\x00')
(b'zyzy', b'(#r\x04k(\x00\x00')
So there is no other data in this file except the 8 unknown bytes per name record.
Interpreting the unknown 8 bytes as a pair of LE ints gives this:
(b"a'leeshan", 0, 9136)
(b"a'leeshanh", 0, 9136)
(b"a'leeshann", 0, 9136)
(b"a'leeshon", 0, 9136)
(b"a'lelia", 9136, 8987)
(b"a'leliah", 9136, 8987)
(b"a'lishan", 0, 9136)
(b"a'lishanh", 0, 9136)
(b"a'lishann", 0, 9136)
(b"a'lishon", 0, 9136)
...
(b'zyporah', 74563130, 7827)
(b'zyra', 74570957, 9422)
(b'zyrah', 74570957, 9422)
(b'zyta', 74670735, 8898)
(b'zytah', 74670735, 8898)
(b'zyvah', 74580379, 8589)
(b'zyzey', 74588968, 10347)
(b'zyzi', 74588968, 10347)
(b'zyzie', 74588968, 10347)
(b'zyzy', 74588968, 10347)
Two things stand out here:
The first value mostly increases, and 74588968 is extremely close to the length of the HUG file, while the second value is always small. This looks like an offset and length, and for example a'leeshon
appears to end exactly where 'a'lelia
starts.
Some similar looking names have the same values. So presumably this is where homophones are handled, and not the LST file.
The highest "offset, length" seen in the file does not occur at the end and is (74979189, 7078). 74979189 + 7078 = 74986267, the exact length of the HUG file. Analysis shows there are no gaps - every part of the hug file is referred to by at least one name record.
The number of unique offsets in the file is 9342, the same as the number of wpk records in the LST file. Therefore I would guess that the LST file is an artifact of the tool used to create the HUG and HIX (Hug IndeX?), and the wpk files are the original audio files used to build it. Therefore the wpk records can be considered the canonical spelling for each homophone.
Matching all the data extracted so far gives us a list of (hug offset, length) (canonical name, [homophones, ...])
This is the beginning and end of it:
(0, 9136) (b"a'lishan.wpk", [b"a'leeshan", b"a'leeshanh", b"a'leeshann", b"a'leeshon", b"a'lishan", b"a'lishanh", b"a'lishann", b"a'lishon", b"a'lyshan", b"a'lyshanh", b"a'lyshann", b"a'lyshon", b'aleeshan', b'aleeshanh', b'aleeshann', b'aleeshon', b'alishan', b'alishanh', b'alishann', b'alishon', b'alyshan', b'alyshanh', b'alyshann', b'alyshon'])
(9136, 8987) (b"a'lelia.wpk", [b"a'lelia", b"a'leliah", b'alelia', b'aleliah'])
(18123, 7354) (b'aliyah.wpk', [b'aaleeyah', b'aaleyah', b'aaliyah', b'aleeah', b'aleeya', b'aleeyah', b'alia', b'aliah', b'alieah', b'aliya', b'aliyah', b'alyyah'])
(25477, 7007) (b'aarica.wpk', [b'aarica', b'aarika', b'aaryca', b'erica', b'ericah', b'ericka', b'erickah', b'erika', b'erikah', b'errica', b'eryca', b'erycah', b'erycka', b'eryckah', b'eryka', b'erykah'])
...
(74952752, 8845) (b'zuri.wpk', [b'zurey', b'zuri', b'zurie', b'zury'])
(74961597, 11118) (b'zuzia.wpk', [b'zuzia', b'zuziah'])
(74972715, 6474) (b'zykia.wpk', [b'zykia', b'zykiah'])
(74979189, 7078) (b'zylpha.wpk', [b'zylpha', b'zylphah'])
The following code parses LST and HIX, then fetches the raw data from the HUG and plays it. You may optionally supply a name on the command line - otherwise the default is zara
. It does not decode the audio, so it sounds terrible (volume warning). It does sound like it is fetching the right names, so all that remains is to fill in the pcm
function with a correct decoding.
import pathlib
import struct
from collections import defaultdict
class HUG:
def __init__(self, path):
self._path = pathlib.Path(path)
_, wpks = self.read_lst()
_, _, _, hugmap = self.read_hix()
self._positions = {}
reverse_tmp = {}
for (offset, length), canonical_name in zip(sorted(hugmap.keys()), wpks):
self._positions[canonical_name[:-4]] = (offset, length)
reverse_tmp[(offset, length)] = canonical_name[:-4]
self._names = {}
self._homophones = defaultdict(list)
for (offset, length), names in hugmap.items():
for name in names:
self._names[name] = reverse_tmp[offset, length]
self._homophones[reverse_tmp[offset, length]].append(name)
self._hug_data = open(self._path / 'pinames.hug', 'rb').read()
def canonical(self, name):
return self._names[name]
def position(self, name):
return self._positions[self.canonical(name)]
def homophones(self, name):
return self._homophones[self.canonical(name)]
def raw_data(self, name):
offset, length = self.position(name)
return self._hug_data[offset:offset+length]
def pcm(self, name):
# TODO: decode it!
return self.raw_data(name)
def play(self, name):
import miniaudio, time
with miniaudio.PlaybackDevice(output_format=miniaudio.SampleFormat.UNSIGNED8,
nchannels=1, sample_rate=22050,
buffersize_msec=500) as device:
pcm = self.pcm(name)
stream = miniaudio.stream_raw_pcm_memory(pcm, 1, 1)
device.start(stream)
time.sleep((len(pcm) + 1000) / 22050)
def read_lst(self):
with open(self._path / 'pinames.lst', 'rb') as lst:
n, = struct.unpack('<I', lst.read(4))
names = [name.rstrip(b'\x00') for name in struct.unpack("20s" * n, lst.read(20 * n))]
n, = struct.unpack('<I', lst.read(4))
wpks = [wpk.rstrip(b'\x00') for wpk in struct.unpack("20s" * n, lst.read(20 * n))]
return names, wpks
def read_hix(self):
with open(self._path / 'pinames.hix', 'rb') as lst:
n, = struct.unpack('<I', lst.read(4))
# we don't care about this data
digrams = struct.unpack("2s" * n, lst.read(2 * n))
offsets = struct.unpack(f'<{n}I', lst.read(4 * n))
counts = struct.unpack(f'<{n}I', lst.read(4 * n))
total = sum(counts)
hugmap = defaultdict(list)
for i in range(total):
name, hug_offset, hug_length = struct.unpack('<20sII', lst.read(28))
hugmap[hug_offset, hug_length].append(name.rstrip(b'\x00'))
return digrams, offsets, counts, hugmap
def dump_all(self):
for name, (offset, length) in sorted(self._positions.items(), key=lambda x: x[1]):
print(offset, length, name, self.homophones(name))
if __name__ == '__main__':
import sys
h = HUG('data/')
try:
name = sys.argv[1].encode('latin1') # what is the true encoding?
except IndexError:
name = b'zara'
h.play(name)
print(h.canonical(name))
print(h.homophones(name))
print(h.position(name))
Inspecting the actual binary blobs, every one begins the same way. The RIFF header is not the beginning of the data, there are 20 bytes before it, like this:
00000000 00 01 ff 03 fd 06 fa 0a f6 0f f1 15 eb 1c e4 7f |................|
00000010 2c 00 00 00 52 49 46 46 58 40 00 00 57 41 56 45 |,...RIFFX@..WAVE|
The looks like 8x LE halfs, increasing monotonically, and then an int, which would decode as:
(256, 1023, 1789, 2810, 4086, 5617, 7403, 32740, 44)
These values are identical for every file. This looks like a compression wrapper to me.
Yep, I figured the same. It looks like they might actually include the RIFF header in theirs, making it 0x42 bytes or so before the compressed data starts.
The decompression function is at 0x475460
in d2.exe, and looks to be char* DecompressAudio(char* in, uint32_t* outlen)
. I don't have time to test that tonight but anyone else is welcome to.
I wonder if this is some sort of early internally-shared static library from RAD Game Tools, earlier than any public tooling supports.
Okay I grabbed the full ISO and I can confirm that the blobs in the HUG are wpk format, same as the ones in the AUDIO subdirectory. First 20 bytes are identical on all files. We have compressed and uncompressed versions of the same files so figuring out the compression shouldn't be too hard...
I suspect it will be, it's a proprietary format. It might ultimately look similar to the Bink audio format supported by libavcodec but it's a bunch of work to confirm that.
It'll be much easier to just call the code in d2 directly. The info above should be enough to give that a go.
Here's what i discovered so far:
(0, 1, -1, 3, -3, 6, -6, 10, -10, 15, -15, 21, -21, 28, -28, 127)
- the last value 0x127 is not used I think, because it is 0xf, which is an escape.I have the decoder working now. It is partly based on a decompile so it is a bit rough. It seems like WPK is actually lossy and rounds the samples to 7 bits. Decoding CREAK.WPK does not produce exactly the original WAV, but it is very close. The names all sound good anyway.
The final thing left to do is figure out the string encoding for the handful of names that have non-ascii characters.
Here is a complete tool to list, extract, or play any name. Depends on click
, numpy
, and optionally miniaudio
, tqdm
.
Run barbie.py --help
#!/usr/bin/env python3
import pathlib
import struct
from collections import defaultdict
import numpy as np
class HUG:
def __init__(self, path):
self._path = pathlib.Path(path)
_, wpks = self.read_lst()
_, _, _, hugmap = self.read_hix()
self._positions = {}
reverse_tmp = {}
for (offset, length), canonical_name in zip(sorted(hugmap.keys()), wpks):
self._positions[canonical_name[:-4]] = (offset, length)
reverse_tmp[(offset, length)] = canonical_name[:-4]
self._names = {}
self._homophones = defaultdict(list)
for (offset, length), names in hugmap.items():
for name in names:
self._names[name] = reverse_tmp[offset, length]
self._homophones[reverse_tmp[offset, length]].append(name)
self._hug_data = open(self._path / 'pinames.hug', 'rb').read()
def canonical(self, name):
return self._names[name]
def position(self, name):
return self._positions[self.canonical(name)]
def homophones(self, name):
return self._homophones[self.canonical(name)]
def raw_data(self, name):
offset, length = self.position(name)
return self._hug_data[offset:offset+length]
def wav(self, name):
raw = self.raw_data(name)
table = struct.unpack("<16b", raw[:16])
header_len, = struct.unpack("<I", raw[16:20])
header = raw[20:20+header_len]
assert len(header) == 44
orig_len, = struct.unpack('<I', header[-4:])
compressed_len, = struct.unpack('<I', raw[20 + header_len:20 + header_len + 4])
compressed = raw[20 + header_len + 4:]
original = np.empty((orig_len + 0x80,), dtype=np.uint8)
original[:] = 0x80 # fill with silence
assert len(compressed) == compressed_len
bVar8 = 0x80
local_14c = 0x80
orig_pos = 0
comp_pos = 0
if compressed_len > 0:
while True:
if orig_pos >= orig_len:
break
bVar3 = int(compressed[comp_pos] >> 4)
if bVar3 == 0xf:
bVar3 = (compressed[comp_pos] << 4) & 0xff
iVar12 = comp_pos + 1
if (bVar3 != 0xf0) or ((compressed[iVar12] & 0xf0) != 0xf0):
bVar8 = bVar3 | compressed[iVar12] >> 4
original[orig_pos] = bVar8
local_14c = bVar8
orig_pos += 1
else:
bVar3 = ((compressed[iVar12] << 4) & 0xff) | (compressed[comp_pos + 2] >> 4)
iVar12 = comp_pos + 2
if bVar3 != 0:
original[orig_pos:orig_pos+bVar3] = bVar8
orig_pos += bVar3
bVar8 = local_14c
else:
iVar2 = int(local_14c) + table[bVar3]
iVar12 = comp_pos
if iVar2 < 0x100:
if iVar2 < 0:
bVar8 = 0
local_14c = 0
original[orig_pos] = 0
else:
bVar8 = int(bVar8) + table[bVar3]
local_14c = bVar8
original[orig_pos] = bVar8
else:
bVar8 = 0xff
local_14c = 0xff
original[orig_pos] = 0xff
orig_pos += 1
if (compressed[iVar12] & 0xf) == 0xf:
comp_pos = iVar12 + 1
bVar3 = compressed[comp_pos]
if bVar3 != 0xff:
original[orig_pos] = bVar3
local_14c = bVar3
orig_pos += 1
bVar8 = bVar3
else:
comp_pos = iVar12 + 2
bVar3 = compressed[comp_pos]
if bVar3 != 0:
original[orig_pos:orig_pos+bVar3] = bVar8
orig_pos += int(bVar3)
bVar8 = local_14c
else:
comp_pos = iVar12
iVar2 = local_14c + table[compressed[comp_pos] & 0xf]
if iVar2 < 0x100:
if iVar2 < 0:
bVar3 = 0
local_14c = 0
original[orig_pos] = 0
else:
bVar3 = bVar8 + table[compressed[iVar12] & 0xf]
local_14c = bVar3
original[orig_pos] = bVar3
else:
bVar3 = 0xff
local_14c = 0xff
original[orig_pos] = 0xff
orig_pos += 1
bVar8 = bVar3
comp_pos += 1
if comp_pos >= compressed_len:
break
return header, original
def play(self, name):
import miniaudio, time
header, pcm = self.wav(name)
with miniaudio.PlaybackDevice(output_format=miniaudio.SampleFormat.UNSIGNED8,
nchannels=1, sample_rate=22050,
buffersize_msec=500) as device:
stream = miniaudio.stream_raw_pcm_memory(pcm, 1, 1)
device.start(stream)
time.sleep((len(pcm) + 1000) / 22050)
def read_lst(self):
with open(self._path / 'pinames.lst', 'rb') as lst:
n, = struct.unpack('<I', lst.read(4))
names = [name.rstrip(b'\x00') for name in struct.unpack("20s" * n, lst.read(20 * n))]
n, = struct.unpack('<I', lst.read(4))
wpks = [wpk.rstrip(b'\x00') for wpk in struct.unpack("20s" * n, lst.read(20 * n))]
return names, wpks
def read_hix(self):
with open(self._path / 'pinames.hix', 'rb') as lst:
n, = struct.unpack('<I', lst.read(4))
digrams = struct.unpack("2s" * n, lst.read(2 * n))
offsets = struct.unpack(f'<{n}I', lst.read(4 * n))
counts = struct.unpack(f'<{n}I', lst.read(4 * n))
total = sum(counts)
hugmap = defaultdict(list)
for i in range(total):
name, hug_offset, hug_length = struct.unpack('<20sII', lst.read(28))
hugmap[hug_offset, hug_length].append(name.rstrip(b'\x00'))
return digrams, offsets, counts, hugmap
@property
def names(self):
return sorted(self._positions.keys())
if __name__ == '__main__':
import click
default_encoding = "latin1"
@click.group()
def barbie():
pass
@barbie.command()
@click.argument("name", type=str, default='zara')
@click.option("-d", "--data", type=click.Path(exists=True), default="data/")
@click.option("-e", "--encoding", type=str, default=default_encoding)
def play(name, data, encoding):
"""Play a name using miniaudio"""
print(name, data)
name = name.encode(encoding)
h = HUG(data)
print('Canonical:', h.canonical(name).decode(encoding))
print('Homophones:', ', '.join(n.decode(encoding) for n in h.homophones(name)))
h.play(name)
@barbie.command()
@click.argument("name", type=str, default='zara')
@click.option("-d", "--data", type=click.Path(exists=True), default="data/")
@click.option("-e", "--encoding", type=str, default=default_encoding)
def lookup(name, data, encoding):
"""Look up a name in the index"""
print(name, data)
name = name.encode(encoding)
h = HUG(data)
print('Canonical:', h.canonical(name))
print('Homophones:', h.homophones(name))
print('Position:', h.position(name))
@barbie.command()
@click.argument("outdir", type=click.Path(writable=True), default="out/")
@click.option("-d", "--data", type=click.Path(exists=True), default="data/")
@click.option("-e", "--encoding", type=str, default=default_encoding)
@click.option("--wpk", is_flag=True, help="Dump compressed WPK files instead of WAV")
def dump(outdir, data, encoding, wpk):
"""Dump all names to WAV files (or WPK)"""
outdir = pathlib.Path(outdir)
h = HUG(data)
outdir.mkdir(parents=True, exist_ok=True)
names = h.names
try:
from tqdm import tqdm
names = tqdm(names)
except ImportError:
pass
for name in names:
fn = (outdir / name.decode(encoding))
if wpk:
fn.with_suffix('.wpk').write_bytes(h.raw_data(name))
else:
fn.with_suffix('.wav').write_bytes(b''.join(h.wav(name)))
@barbie.command(name="list")
@click.option("-d", "--data", type=click.Path(exists=True), default="data/")
@click.option("-e", "--encoding", type=str, default=default_encoding)
def _list(data, encoding):
"""List all names in the index"""
h = HUG(data)
for name in h.names:
print(name.decode(encoding), '=', ', '.join(n.decode(encoding) for n in h.homophones(name)))
barbie()
That's a lot simpler than I expected! I didn't dig into the two calls the decrypt function makes, but it looks like they're both memory allocation functions. I've also had a quick look and can't find any other code around this area so it's probably not a static library but a custom compression function.
I'd suggest something like the following renames:
bVar3 -> c bVar8 -> out_c local_14c -> last_c iVar2 -> sum_c table -> codebook/dictionary iVar12 -> next_pos
Doesn't help that much but they're kind of overloaded. I'd also swap the if < 0x100
logic around, so you have if sum_c >= 0x100: ... elif sum_c < 0: ...
.
This is all fantastic work! I haven't fully mentally parsed it yet (and likely can't dig in deep until after work this evening), but would love to pull this into the repo.
The final thing left to do is figure out the string encoding for the handful of names that have non-ascii characters.
I'd suggest using ftfy to work that out (it's probably win1252, but ftfy will confirm that)
Here's all the non-ascii names with cp1252
decoding. It looks plausible except for "cŽstorah", which is a version of "castora"
b'd\x92laine' d’laine
b'ni\xedmah' niímah
b'ny\xedmah' nyímah
b'o\xedkeeshiah' oíkeeshiah
b'o\xedkeshia' oíkeshia
b'c\x8estorah' cŽstorah
b'o\xedkeashia' oíkeashia
b'adoraci\xf3n' adoración
b'o\xedqueshiah' oíqueshiah
b'o\xedqueshia' oíqueshia
b'o\xedkeshiah' oíkeshiah
b'd\x92layne' d’layne
b'o\xedkieshia' oíkieshia
b'o\xedkeashiah' oíkeashiah
b'nee\xedmah' neeímah
b'd\x92juana' d’juana
b'o\xedkeeshia' oíkeeshia
b'o\xedkieshiah' oíkieshiah
Canonical: castora
Homophones: castora, castorah, cŽstorah, kastora, kastorah
Edit: complete decoder is here
The LST file contains two distinct sections. Each is a table of 20 byte string records, null padded. Each table is preceeded by a 32 bit le integer indicating the number of records. Thus the LST may be parsed with the following code:
The first table contains 50014 entries and seems to be a list of every name recognized by the program.
The second table contains 9342 entries. It is a shorter list of names, but each has the suffix
.wpk
. I suspect this is the list of names for which audio files exist.There is no data about how names from the first table get mapped to the second table - there is literally no other data contained in this file. The above parser never seeks - it reads the indicated number of records, and ends up exactly at the end of the file.
Googling "wpk audio format" surfaces the following piece of software: https://github.com/Morilli/bnk-extract - this looks like exactly the right type of thing. It is a tool for unpacking game audio packs...