Closed gw0 closed 8 years ago
The raw files are taken directly from the Penn Treebank. I looked at the gold standard parse and the raw text file from the PTB and it seemed that it's simply an error in the raw file. I am inclined to not change it since it's from PTB.
From the tutorial it would seem that all data for CoNLL16st is encoded in
utf8
, includingraw
text files. Unfortunately there are three encoding problems in conll16st-en-01-12-16-train (other files and languages seem to be ok):The tutorial should include information on which encoding is used where or at least how such errors were handled in the tokenizer.