eagleflo / mpyq

Python library for reading MPQ archives.
BSD 2-Clause "Simplified" License
100 stars 23 forks source link

Invalid hash_table_offset? #19

Closed GraylinKim closed 11 years ago

GraylinKim commented 11 years ago

This file has a length of 65892 bytes but archive.header has the following offset values:

This makes table data an empty string and causes a struct.error when you try to unpack the table entries.

  File "/home/graylinkim/projects/sc2reader/env/bin/mpyq", line 8, in <module>
    load_entry_point('mpyq==0.2.0', 'console_scripts', 'mpyq')()
  File "/home/graylinkim/projects/mpyq/mpyq.py", line 392, in main
    archive = MPQArchive(args.file)
  File "/home/graylinkim/projects/mpyq/mpyq.py", line 99, in __init__
    self.hash_table = self.read_table('hash')
  File "/home/graylinkim/projects/mpyq/mpyq.py", line 170, in read_table
    return [unpack_entry(i) for i in range(table_entries)]
  File "/home/graylinkim/projects/mpyq/mpyq.py", line 168, in unpack_entry
    struct.unpack(entry_class.struct_format, entry_data))
struct.error: unpack requires a string argument of length 16

I want to turn around and say that the MPQ file is corrupt and there is nothing we can do but the person submitting the files (see GraylinKim/sc2reader#100) is suggesting that it came directly from his SCII client and that it opens just fine. @dsjoerg says that he has received several more files with a similar issues that he can provide.

Is there some way we can prove this one way or another?

eagleflo commented 11 years ago

Ok, a couple of things are at play here.

First thing I noticed when looking at the file is that the MPQ file header is off by one byte, starting at byte 1025 instead of the usual 1024. The file's user data header still says that the file header starts at byte 1024. I have no idea why it's lying. Compare these two:

$ hexdump -C test.SC2Replay
00000000  4d 50 51 1b 00 02 00 00  00 04 00 00 3c 00 00 00  |MPQ.........<...|
00000010  05 08 00 02 2c 53 74 61  72 43 72 61 66 74 20 49  |....,StarCraft I|
00000020  49 20 72 65 70 6c 61 79  1b 31 31 02 05 0c 00 09  |I replay.11.....|
00000030  02 02 09 02 04 09 00 06  09 02 08 09 86 fd 01 0a  |................|
00000040  09 da f0 01 04 09 04 06  09 88 a3 01 00 00 00 00  |................|
00000050  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000400  4d 50 51 1a 2c 00 00 00  47 45 00 00 01 00 03 00  |MPQ.,...GE......|
00000410  a7 43 00 00 a7 44 00 00  10 00 00 00 0a 00 00 00  |.C...D..........|
00000420  00 00 00 00 00 00 00 00  00 00 00 00 05 1c 00 04  |................|
00000430  01 00 04 05 12 00 02 10  48 45 49 44 45 47 45 52  |........HEIDEGER|
00000440  02 05 08 00 09 04 02 07  00 00 53 32 04 09 02 08  |..........S2....|
00000450  09 b2 c6 16 04 02 0c 54  65 72 72 61 6e 06 05 08  |.......Terran...|
00000460  00 09 fe 03 02 09 e8 02  04 09 28 06 09 3c 08 09  |..........(..<..|
00000470  04 0a 09 02 0c 09 c8 01  0e 09 00 10 09 04 05 12  |................|
00000480  00 02 08 61 72 6b 78 02  05 08 00 09 04 02 07 00  |...arkx.........|
00000490  00 53 32 04 09 02 08 09  86 c4 56 04 02 0e 50 72  |.S2.......V...Pr|
000004a0  6f 74 6f 73 73 06 05 08  00 09 fe 03 02 09 00 04  |otoss...........|
000004b0  09 84 01 06 09 fe 03 08  09 00 0a 09 00 0c 09 00  |................|
000004c0  0e 09 00 10 09 02 02 02  1a 53 63 72 61 70 20 53  |.........Scrap S|
000004d0  74 61 74 69 6f 6e 04 02  00 06 05 02 00 02 16 4d  |tation.........M|
000004e0  69 6e 69 6d 61 70 2e 74  67 61 08 06 01 0a 09 c2  |inimap.tga......|
$ hexdump -C new_format.SC2Replay
00000000  4d 50 51 1b 00 02 00 00  00 04 00 00 3c 00 00 00  |MPQ.........<...|
00000010  05 08 00 02 2c 53 74 61  72 43 72 61 66 74 20 49  |....,StarCraft I|
00000020  49 20 72 65 70 6c 61 79  1b 31 31 02 05 0c 00 09  |I replay.11.....|
00000030  02 02 09 04 04 09 00 06  09 08 08 09 e0 85 03 0d  |................|
00000040  0a 09 e0 85 03 04 09 04  06 09 92 8b 02 00 00 00  |................|
00000050  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000400  00 4d 50 51 1a d0 00 00  00 53 fc 00 00 03 00 05  |.MPQ.....S......|
00000410  00 93 fa 00 00 93 fb 00  00 10 00 00 00 0c 00 00  |................|
00000420  00 00 00 00 00 00 00 00  00 00 00 00 00 53 fc 00  |.............S..|
00000430  00 00 00 00 00 88 f9 00  00 00 00 00 00 34 f9 00  |.............4..|
00000440  00 00 00 00 00 00 01 00  00 00 00 00 00 c0 00 00  |................|
00000450  00 00 00 00 00 00 00 00  00 00 00 00 00 44 00 00  |.............D..|
00000460  00 00 00 00 00 fb 00 00  00 00 00 00 00 00 40 00  |..............@.|
00000470  00 46 7c df 51 54 a5 9f  d4 82 4d dc c0 8a 12 2f  |.F|.QT....M..../|
00000480  99 2b 29 fe 11 70 51 54  82 fc 81 84 4d b4 70 ea  |.+)..pQT....M.p.|
00000490  cb 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000004a0  00 d7 e0 31 1f f2 66 73  3d f6 2f 91 87 e5 bc 6e  |...1..fs=./....n|
000004b0  a6 3b 8d 27 20 60 3b 38  62 52 17 63 d3 ea 64 4a  |.;.' `;8bR.c..dJ|
000004c0  da f2 0b d1 be 35 44 82  13 a2 60 27 43 bf b2 20  |.....5D...`'C.. |
000004d0  4a 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |J...............|
000004e0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

Notice the one extra "." before the MPQ0x1A signature. I haven't seen MPQs with invalid header offsets before, so mpyq does not contain any workarounds for them.

Taking that into account, the file header says that the MPQ format is now 0x03, meaning that this is a newer MPQ format than what this library is used to. Back in 2010 when I originally developed this library the MPQ version used was 0x01. It seems like the reference I used back then at http://www.zezula.net/en/mpq/mpqformat.html has been updated to cover formats 3 and 4. There are new HET and BET tables present in these archives.

It seems like I need to roll my sleeves a little and start supporting MPQ formats 3 and 4. I was planning to test mpyq with Diablo III files back when it came out but I lost my interest in that game too quickly. Now that Heart of the Swarm is soon here it would make sense to test this library with the latest versions of the game.

In my mind the two issues listed above are separate. I can't see any mention of headers now being off-by-one in the reference, so I'm also tempted to call this a corrupted MPQ file, albeit of a newer version. I'm closing this issue and opening a new one for MPQ formats 3 and 4. If the invalid header offset thing becomes a phenomenon, let's open a new issue for it. In any case mpyq should raise an error and fail gracefully when it can't find the header at the correct offset.