eagleflo / mpyq

Python library for reading MPQ archives.
BSD 2-Clause "Simplified" License
100 stars 23 forks source link

MPQ Readfile Error #7

Closed GraylinKim closed 13 years ago

GraylinKim commented 13 years ago

Hi, I've been using your library for sometime now as a dependency to my sc2reader library. One of my users recently reported an issue that stems from your read_file function; in particular:

 File "XXXX/mpyq/mpyq.py", Line 203, in read_file:
     for i in range(len(positions) - (2 if crc else 1)):

  UnboundLocalError: local variable `crc` referenced before assignment

Its a pretty simple issue but have trying to resolve it myself, I don't know the MPQ format well enough to devise the correct fix here. I have a file that generates this error, how should I send it to you?

eagleflo commented 13 years ago

I can guess where the error comes from, and I'll devise a potential fix for the problem. I'd be very interested in the file anyway, so if you could upload it somewhere and post a link into this issue I would be very grateful.

eagleflo commented 13 years ago

I've uploaded a new point release that should contain a fix for the UnboundLocalError at least. I have no access to a MPQ file that doesn't have the MPQ_FILE_SECTOR_CRC flag set, however, so I'm still very interested in the file that triggered this error.

alexhanh commented 13 years ago

It's this replay http://drop.sc/61/d

eagleflo commented 13 years ago

Ok, either the file has completely messed up block table or my library has a bug in it and somehow reads it wrong. This could happen if changes are written to an MPQ file but its block table is not updated accordingly. I'll take a closer look and see if I can somehow get my library to work with this file. Thanks for reporting the issue!

alexhanh commented 13 years ago

I'm currently using a library called phpsc2replay to parse the replays. It seems to parse everything correctly, except for the messages. I opened this replay with a program called Sc2gears and it showed the following messages as output:

http://i.min.us/ilPplo.png

Those "sc2.replays.net" comments hit my eye.. they could be very well added afterwards. Could this be causing the problems?

eagleflo commented 13 years ago

They've definitely been added afterwards, and that is what seems to be causing trouble. It might still be a problem in the library that only these altered files trigger. I need to investigate more.

alexhanh commented 13 years ago

You'll find several more replays from that same site here: http://sc2.replays.net/ (scroll to the bottom).. They all seem to have those messages added. Examples:

http://sc2.replays.net/replay/Sc2Replay.aspx?ReplayID=3639 http://sc2.replays.net/replay/Sc2Replay.aspx?ReplayID=3638 http://sc2.replays.net/replay/Sc2Replay.aspx?ReplayID=3637

(There's the "Download REP" -link)

alexhanh commented 13 years ago

Is the problem with the CRC/MD5 that wasn't recomputed after adding those messages? Could this be somehow detected by the lib and possibly recomputed?

I was able to open this replay with Ladik's MPQ editor, which is open source. Perhaps it would be helpful to look there? http://www.zezula.net/en/mpq/download.html#StormLib

I'd also be happy with it just reporting about a "corrupted" replay.

Kiitos!

alexhanh commented 13 years ago

Further investigating the replay using Ladik's MPQ editor I discovered a couple of things:

eagleflo commented 13 years ago

The exact part of the code that is supposed to deal with MPQs that have multiple sectors starts from https://github.com/arkx/mpyq/blob/master/mpyq.py#L191

Untouched StarCraft II replay files always have the MPQ_FILE_SINGLE_UNIT flag set, which is why I didn't see this problem earlier. I think I have a pretty good clue about what's wrong with my library's current code, let's see if I can solve the problem.

Big thanks for the help so far and don't hesitate to fork and create a pull request if you can fix it by yourself before I do!

eagleflo commented 13 years ago

Specifically, the local variable positions seems to be a random garbage value instead of something meaningful (https://github.com/arkx/mpyq/blob/master/mpyq.py#L201). This makes me think it's a bug in the library and specifically in how it handles MPQ files without the MPQ_FILE_SINGLE_UNIT flag.

eagleflo commented 13 years ago

phpsc2replay doesn't even attempt to read the file '(listfile)', which is exactly the troublesome file in the replay. It goes straight for the replay files instead. While that's ok for a project that's 100% focused on StarCraft II replays, this library is supposed to work with all kinds of MPQ files.

alexhanh commented 13 years ago

phpsc2replay still fails to parse the messages correctly (try uploading the replay here http://kuukkeli.ath.cx/ooaemkee/SC2Replay/upload_file.php and see the messages section). I've pulled a couple of hairs out trying to figure out why it fails. It seems that "unknown compression" is triggered for the replay.message.events file.

eagleflo commented 13 years ago

From my tests it seems that it's '(listfile)' that causes the error—of course replay.message.events might be corrupted as well. The problem seems to be that the file has been tampered with and currently mpyq lacks support for this kind of archives.

$ ./mpyq.py -Hb buggy.SC2Replay 
MPQ archive hash table
----------------------
 Hash A   Hash B  Locl Plat BlockIdx
D38437CB 07DFEAEC 0000 0000 00000009
AAC2A54B F4762B95 0000 0000 00000002
FFFFFFFF FFFFFFFF FFFF FFFF FFFFFFFF
FFFFFFFF FFFFFFFF FFFF FFFF FFFFFFFF
FFFFFFFF FFFFFFFF FFFF FFFF FFFFFFFF
C9E5B770 3B18F6B6 0000 0000 00000005
343C087B 278E3682 0000 0000 00000004
3B2B1EA0 B72EF057 0000 0000 00000006
5A7E8BDC FF253F5C 0000 0000 00000001
FD657910 4E9B98A7 0000 0000 00000008
D383C29C EF402E92 0000 0000 00000000
FFFFFFFF FFFFFFFF FFFF FFFF FFFFFFFF
FFFFFFFF FFFFFFFF FFFF FFFF FFFFFFFF
FFFFFFFF FFFFFFFF FFFF FFFF FFFFFFFF
1DA8B0CF A2CEFF28 0000 0000 00000007
31952289 6A5FFAA3 0000 0000 00000003

MPQ archive block table
-----------------------
 Offset  ArchSize RealSize  Flags
0000002C      475      475 81000200
00000207      660     1195 81000200
0000049B    11182    38694 81000200
000036EB      133      133 80000200
000030A5       96       96 81000200
00003105      355      450 81000200
00003268      513      787 81000200
00003469      245      528 81000200
00003770      112      190 80010200  <-- (listfile), error
000037E0      218      288 80000200

Traceback (most recent call last):
  File "./mpyq.py", line 363, in <module>
    main()
  File "./mpyq.py", line 349, in main
    archive = MPQArchive(args.file)
  File "./mpyq.py", line 95, in __init__
    self.files = self.read_file('(listfile)').splitlines()
  File "./mpyq.py", line 208, in read_file
    sector = decompress(sector)
  File "./mpyq.py", line 174, in decompress
    compression_type = ord(data[0])
IndexError: string index out of range

The "unknown compression" error you see is likely the result of corruption: typically with compressed files inside a MPQ archive the first byte tells which compression type was used with the file. In the case of StarCraft II replay files only DEFLATE and bzip2 are ever used.

eagleflo commented 13 years ago

I'd be really interested in how some other MPQ libraries deal with files like this. I'm currently taking a look at StormLib.

Blizzard itself for sure doesn't supply any files like this—I've tried looking into all sorts of MPQ files that come with StarCraft II and World of Warcraft and I have never seen this error come up.

eagleflo commented 13 years ago

Ok, the (listfile) has been encrypted for some reason after the tampering. That's why it chokes on seemingly random garbage. mpyq does not support encrypted archives yet, but it has been on my "to do" list, see issue #3.

GraylinKim commented 13 years ago

Thanks for the quick follow up Aku.

Will you be looking to resolve the encryption issue in the near future?

eagleflo commented 13 years ago

Reading through the source code of StormLib, it appears that there is a special flag for cases when the (listfile) is not valid (MPQ_FLAG_LISTFILE_VALID).

I read a little about the encryption used and it appears to be very simple. However, I could still not get a correct listfile out of the tampered replay. I need to find a normal encrypted file to test with before I can continue implementing support for encryption.

alexhanh commented 13 years ago

Perhaps this one? http://drop.sc/67

Not sure if it's been tampered with tho!

GraylinKim commented 13 years ago

Would you be able to write up a script that would be able to tell if a file has "normal encryption" so that we could help you find a couple examples?

I'm not sure how one would go about finding encrypted replay files.

alexhanh commented 13 years ago

@GraylinKim,

There's a program called Ladik's MPQ editor (based on StormLib) that tells which flags the files contained in MPQ have. If you want to try it out, you can download the editor here http://www.zezula.net/en/mpq/download.html .

alexhanh commented 13 years ago

Not sure if we can find replays with non-tampered encrypted list file since Blizzard doesn't encrypt SC2Replays.

eagleflo commented 13 years ago

Any encrypted file within a MPQ archive would do. I haven't had time to browse around but I'd figure a lot of the assets within StarCraft II and World of Warcraft are encrypted.

But indeed, StarCraft II replays normally only have unencrypted single unit files. Whichever library that replay site is using to inject their ad into the replay should do a better job.

GraylinKim commented 13 years ago

How difficult would it be to fork off a SCII specific branch of this project that works around these problems for sc2replay files specifically?

If its not too much adaption effort, that might be the best route to go down in the short term until we can find (somehow) some encrypted replays.

eagleflo commented 13 years ago

I'm not going to do it, but you're welcome to do so if you wish.

The problem is actually triggered because during the creation of a MPQArchive object mpyq tries to read the MPQ's listfile to get a full listing of files inside the archive. In this replay's case it can't do so because the listfile is garbage -- it might be just encrypted, but I couldn't decrypt it using the methods described in the MPQ references listed in the README.

I could add some sort of argument to the constructor of MPQArchive that would skip reading the listfile, thus avoiding this particular problem. If you know which files you want to read from the archive in advance (like you do with StarCraft II replays), you can just read them from the archive yourself using the read_file() method.

Here is an imagined example:

>>> archive = mpyq.MPQArchive('buggy.SC2Replay', no_listfile=True)
# The 'files' attribute is unavailable now.
>>> archive.files
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'MPQArchive' object has no attribute 'files'
>>> archive.read_file('replay.details')
'\x05\x1c\x00\x04\x01\x00\x04\x05...'

Does this sound like an acceptable workaround? I actually already tried this and it seems to work. You could at least read all the untampered files this way, possibly even the tampered one.

GraylinKim commented 13 years ago

I'll probably look into this then, thank you. I think I will implement a constructor option to supply a list of files instead of extracting the list from the MPQ. I'll keep you posted on that.

eagleflo commented 13 years ago

I've now added the option to skip reading the listfile at arkx/mpyq@0f6c6b784. Give it a try and tell me if you still have trouble.

GraylinKim commented 13 years ago

Thanks for the patch arkx!

Its working great now, would you be able to bump the version and push this to pypi so we don't need to instruct people to get a repository clone?

Thanks again, you've been very helpful with this.

eagleflo commented 13 years ago

Sure, done.

alexhanh commented 13 years ago

We have a problem with the 0.1.6 release. The tampered file's (replay.message.events) compressed size isn't apparently updated after it has been decompressed, modified and compressed again by the villains.

MPQ archive block table
-----------------------
 Offset  ArchSize RealSize  Flags
0000002C      475      475 81000200
00000207      660     1195 81000200
0000049B    11182    38694 81000200
000036EB      133      133 80000200
000030A5       96       96 81000200
00003105      355      450 81000200
00003268      513      787 81000200
00003469      245      528 81000200
00003770      112      190 80010200  <-- (listfile), error
000037E0      218      288 80000200

(replay.message.events is the 000036EB, 133, 133)

0.1.6 adds an if clause block_entry.size > block_entry.archived_size, so the decompression isn't run and replay.message.events appears as gibberish.

As this perhaps isn't something mpyq should be handling (tampered replay files with incorrect syntax), what would you recommend doing here? (It was working with 0.1.5 as the decompression was ran)

How important is the block_entry.size > block_entry.archived_size condition for un-compressed multi-unit blocks? And could it be block_entry.size >= block_entry.archived_size instead?

eagleflo commented 13 years ago

It's important: StarCraft II map files for example need that.

For now, depend on 0.1.5 until a better solution can be found.