Encoding of raw data in datasets?

From the tutorial it would seem that all data for CoNLL16st is encoded in utf8, including raw text files. Unfortunately there are three encoding problems in conll16st-en-01-12-16-train (other files and languages seem to be ok):

./conll16st-en-train/raw/wsj_1069: 'utf8' codec can't decode byte 0xd5 in position 923: invalid continuation byte
./conll16st-en-train/raw/wsj_1870: 'utf8' codec can't decode byte 0xd5 in position 8010: invalid continuation byte
./conll16st-en-train/raw/wsj_2055: 'utf8' codec can't decode byte 0xd5 in position 2512: invalid continuation byte

>>> import codecs
>>> f = codecs.open("./conll16st-en-train/raw/wsj_1069", 'r', encoding='utf8')
>>> f.read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/gw/Projects/StudentPhD/conll16st-multi-classifier-keras/venv/lib/python2.7/codecs.py", line 668, in read
    return self.reader.read(size)
  File "/home/gw/Projects/StudentPhD/conll16st-multi-classifier-keras/venv/lib/python2.7/codecs.py", line 474, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd5 in position 923: invalid continuation byte

The tutorial should include information on which encoding is used where or at least how such errors were handled in the tokenizer.

attapol / conll16st

Encoding of raw data in datasets? #4