Encoding Error for certain replays

RobertTheNerd commented 9 years ago

Hi Guys,

Thanks for the great work of sc2reader! As the developer of SC2Geeks, I'm using this awesome library to parse replays. It's been doing a great job, very much appreciated!

Lately, while processing the WCS 2014 S3 replay pack, I encountered some errors that prevented some of the replays (download here) from being parsed. The error message being:

ERROR:root:Traceback (most recent call last):
  File "sc2parser.py", line 247, in parse_replay_dict
    replay = sc2reader.load_replay(replayFile, load_level=2 if is_summary else 4, load_map=load_map)
  File "/opt/python/python2.7/local/lib/python2.7/site-packages/sc2reader/factories/sc2factory.py", line 85, in load_replay
    return self.load(Replay, source, options, **new_options)
  File "/opt/python/python2.7/local/lib/python2.7/site-packages/sc2reader/factories/sc2factory.py", line 137, in load
    return self._load(cls, resource, filename=filename, options=options)
  File "/opt/python/python2.7/local/lib/python2.7/site-packages/sc2reader/factories/sc2factory.py", line 146, in _load
    obj = cls(resource, filename=filename, factory=self, **options)
  File "/opt/python/python2.7/local/lib/python2.7/site-packages/sc2reader/resources.py", line 262, in __init__
    self._read_data(data_file, self._get_reader(data_file))
  File "/opt/python/python2.7/local/lib/python2.7/site-packages/sc2reader/resources.py", line 592, in _read_data
    self.raw_data[data_file] = reader(data, self)
  File "/opt/python/python2.7/local/lib/python2.7/site-packages/sc2reader/readers.py", line 102, in __call__
    ) for i in range(data.read_bits(5))],
  File "/opt/python/python2.7/local/lib/python2.7/site-packages/sc2reader/decoders.py", line 252, in read_aligned_string
    return self._buffer.read_string(count, encoding)
  File "/opt/python/python2.7/local/lib/python2.7/site-packages/sc2reader/decoders.py", line 108, in read_string
    return self.read_bytes(count).decode(encoding)
  File "/opt/python/python2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x81 in position 23: invalid start byte

The byte and position vary for different replays. The calling script is essentially invoking sc2reader to parse a given replay file.

This error can be replicated on both CentOS 6.5 and Ubuntu 14.04, Python2.7 and Python 3.4. Since it's encoding related, I double-checked and below is the output of locale:

LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

What's confusing is that there is no error when running on Mac and it'll just run through. Below is the output of locale on the mac:

LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=

As GGTracker and Spawning Tools can both parse those failed replays, I'm not sure if it's a bug of sc2reader or can be resolved by tweaking the Python environment instead. Due to my limited knowledge in Python I tried one recommended approach to set the default encoding to non-utf8 for python2 but failed.

PYTHONIOENCODING="ascii" python sc2parser.py /tmp/G1.SC2Replay

I appreciate in advance for your time looking into this. Thanks!

Regards, Robert

StoicLoofah commented 9 years ago

So I don't remember doing anything special on spawningtool to get these replays to work. If it helps, here's what I got in my system:

locale -a
C
C.UTF-8
en_AG
en_AG.utf8
en_AU.utf8
en_BW.utf8
en_CA.utf8
en_DK.utf8
en_GB.utf8
en_HK.utf8
en_IE.utf8
en_IN
en_IN.utf8
en_NG
en_NG.utf8
en_NZ.utf8
en_PH.utf8
en_SG.utf8
en_US.utf8
en_ZA.utf8
en_ZM
en_ZM.utf8
en_ZW.utf8
POSIX

locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL

StoicLoofah commented 9 years ago

Also, since this is more of a discussion of usage rather than a sc2reader bug, I think it's better to post these sorts of issues to the mailing list http://groups.google.com/group/sc2reader

RobertTheNerd commented 9 years ago

Hi Kevin,

Thanks for your help and suggestion! The output of locale -a is a lot longer than yours (over 700 lines). Let me do a quick check and see if this somehow interfere with the replay parsing. Afterwards, I'll post in the mailing list if the issue persists.

aa_DJ
aa_DJ.iso88591
aa_DJ.utf8
aa_ER
...
zu_ZA.iso88591
zu_ZA.utf8

Robert

RobertTheNerd commented 9 years ago

I tried to remove the unneeded locales and it's not helping. Also I fired up a CentOS 7and still have the problem.

I'll post the issue in the mailing list like Kevin suggests.

Thanks, Robert

RobertTheNerd commented 9 years ago

After trying on different virtual machines, I finally get this to work. The trick is to download the latest source code from Github and set PYTHONPATH to the downloaded folder instead using the one installed in the python directory.

I then did a folder comparison and found out that even though the version in Github is 0.6.4 which is behind the one installed using pip (0.6.5), reader.py has an extra line that seems to have resolved the issue I'm having: python-diff

I guess I'll just use the latest source code from Github instead of installing into python's lib folder.

Thanks everyone!

Robert

GraylinKim / sc2reader

Encoding Error for certain replays #182