facebookresearch / ParlAI

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.
https://parl.ai
MIT License
10.47k stars 2.09k forks source link

Spacy support for unicode #789

Closed haow85 closed 5 years ago

haow85 commented 6 years ago

My original motivation was to make Parlai support Chinese, so I digged some information about Spacy. In Parlai code, there is a line spacy.load("en") , so I'm looking for something like spacy.load("ch“). Turns out on Spacy's "Models" page, there is no Chinese language, but only a multi-language model. Looks like the multi-language model is from Wikipedia, so I guess it should support Chinese ?

So I did a _self.NLP = spacy.load('xx_ent_wikism') in core.dict.py . When I input some UTF-8 Chinese questions, I got the following error :

Enter Your Message: 蓝色是颜色. \n 蓝色是什么? Traceback (most recent call last): File "interactive.py", line 56, in interactive(setup_args().parse_args()) File "interactive.py", line 45, in interactive world.parley() File "/usr/local/miniconda/envs/py36/lib/python3.6/site-packages/parlai-0.1.0-py3.6.egg/parlai/core/worlds.py", line 240, in parley acts[1] = agents[1].act() File "/usr/local/miniconda/envs/py36/lib/python3.6/site-packages/parlai-0.1.0-py3.6.egg/parlai/agents/drqa/drqa.py", line 177, in act ex = self._build_ex(self.observation) File "/usr/local/miniconda/envs/py36/lib/python3.6/site-packages/parlai-0.1.0-py3.6.egg/parlai/agents/drqa/drqa.py", line 259, in _build_ex inputs['document'], doc_spans = self.word_dict.span_tokenize(document) File "/usr/local/miniconda/envs/py36/lib/python3.6/site-packages/parlai-0.1.0-py3.6.egg/parlai/core/dict.py", line 324, in span_tokenize return self.spacy_span_tokenize(text) File "/usr/local/miniconda/envs/py36/lib/python3.6/site-packages/parlai-0.1.0-py3.6.egg/parlai/core/dict.py", line 296, in spacy_span_tokenize tokens = self.NLP.tokenizer(text) File "tokenizer.pyx", line 100, in spacy.tokenizer.Tokenizer.call File "strings.pyx", line 21, in spacy.strings.hash_string UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-14: surrogates not allowed

I tried a encode ('utf-8') , encode('utf-8').decode('utf-8') , neither worked. So I switched to encode('gbk') as follows and the program still won't pass :

Enter Your Message: 蓝色是颜色. \n n蓝色是什么? Traceback (most recent call last): File "interactive.py", line 56, in interactive(setup_args().parse_args()) File "interactive.py", line 45, in interactive world.parley() File "/usr/local/miniconda/envs/py36/lib/python3.6/site-packages/parlai-0.1.0-py3.6.egg/parlai/core/worlds.py", line 240, in parley acts[1] = agents[1].act() File "/usr/local/miniconda/envs/py36/lib/python3.6/site-packages/parlai-0.1.0-py3.6.egg/parlai/agents/drqa/drqa.py", line 177, in act ex = self._build_ex(self.observation) File "/usr/local/miniconda/envs/py36/lib/python3.6/site-packages/parlai-0.1.0-py3.6.egg/parlai/agents/drqa/drqa.py", line 259, in _build_ex inputs['document'], doc_spans = self.word_dict.span_tokenize(document) File "/usr/local/miniconda/envs/py36/lib/python3.6/site-packages/parlai-0.1.0-py3.6.egg/parlai/core/dict.py", line 324, in span_tokenize return self.spacy_span_tokenize(text) File "/usr/local/miniconda/envs/py36/lib/python3.6/site-packages/parlai-0.1.0-py3.6.egg/parlai/core/dict.py", line 296, in spacy_span_tokenize tokens = self.NLP.tokenizer(text.encode('gbk').decode('utf-8')) UnicodeEncodeError: 'gbk' codec can't encode character '\udce8' in position 0: illegal multibyte sequence

alexholdenmiller commented 6 years ago

Marking as "Help Wanted" as I'm not sure how to fix this myself.

stephenroller commented 5 years ago

Closing out of staleness. Spacy is several versions later now, and no one has worked on this. Please reopen if you still need this.