attapol / conll16st

CoNLL 2016 Shared Task in English and Chinese Shallow Discourse Parsing
29 stars 12 forks source link

Encoding of raw data in datasets? #4

Closed gw0 closed 8 years ago

gw0 commented 8 years ago

From the tutorial it would seem that all data for CoNLL16st is encoded in utf8, including raw text files. Unfortunately there are three encoding problems in conll16st-en-01-12-16-train (other files and languages seem to be ok):

./conll16st-en-train/raw/wsj_1069: 'utf8' codec can't decode byte 0xd5 in position 923: invalid continuation byte
./conll16st-en-train/raw/wsj_1870: 'utf8' codec can't decode byte 0xd5 in position 8010: invalid continuation byte
./conll16st-en-train/raw/wsj_2055: 'utf8' codec can't decode byte 0xd5 in position 2512: invalid continuation byte
>>> import codecs
>>> f = codecs.open("./conll16st-en-train/raw/wsj_1069", 'r', encoding='utf8')
>>> f.read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/gw/Projects/StudentPhD/conll16st-multi-classifier-keras/venv/lib/python2.7/codecs.py", line 668, in read
    return self.reader.read(size)
  File "/home/gw/Projects/StudentPhD/conll16st-multi-classifier-keras/venv/lib/python2.7/codecs.py", line 474, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd5 in position 923: invalid continuation byte

The tutorial should include information on which encoding is used where or at least how such errors were handled in the tokenizer.

attapol commented 8 years ago

The raw files are taken directly from the Penn Treebank. I looked at the gold standard parse and the raw text file from the PTB and it seemed that it's simply an error in the raw file. I am inclined to not change it since it's from PTB.