My original motivation was to make Parlai support Chinese, so I digged some information about Spacy. In Parlai code, there is a line spacy.load("en") , so I'm looking for something like spacy.load("ch“). Turns out on Spacy's "Models" page, there is no Chinese language, but only a multi-language model. Looks like the multi-language model is from Wikipedia, so I guess it should support Chinese ?
So I did a _self.NLP = spacy.load('xx_ent_wikism') in core.dict.py . When I input some UTF-8 Chinese questions, I got the following error :
Enter Your Message: 蓝色是颜色. \n 蓝色是什么?
Traceback (most recent call last):
File "interactive.py", line 56, in
interactive(setup_args().parse_args())
File "interactive.py", line 45, in interactive
world.parley()
File "/usr/local/miniconda/envs/py36/lib/python3.6/site-packages/parlai-0.1.0-py3.6.egg/parlai/core/worlds.py", line 240, in parley
acts[1] = agents[1].act()
File "/usr/local/miniconda/envs/py36/lib/python3.6/site-packages/parlai-0.1.0-py3.6.egg/parlai/agents/drqa/drqa.py", line 177, in act
ex = self._build_ex(self.observation)
File "/usr/local/miniconda/envs/py36/lib/python3.6/site-packages/parlai-0.1.0-py3.6.egg/parlai/agents/drqa/drqa.py", line 259, in _build_ex
inputs['document'], doc_spans = self.word_dict.span_tokenize(document)
File "/usr/local/miniconda/envs/py36/lib/python3.6/site-packages/parlai-0.1.0-py3.6.egg/parlai/core/dict.py", line 324, in span_tokenize
return self.spacy_span_tokenize(text)
File "/usr/local/miniconda/envs/py36/lib/python3.6/site-packages/parlai-0.1.0-py3.6.egg/parlai/core/dict.py", line 296, in spacy_span_tokenize
tokens = self.NLP.tokenizer(text)
File "tokenizer.pyx", line 100, in spacy.tokenizer.Tokenizer.call
File "strings.pyx", line 21, in spacy.strings.hash_string
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-14: surrogates not allowed
I tried a encode ('utf-8') , encode('utf-8').decode('utf-8') , neither worked. So I switched to encode('gbk') as follows and the program still won't pass :
Enter Your Message: 蓝色是颜色. \n n蓝色是什么?
Traceback (most recent call last):
File "interactive.py", line 56, in
interactive(setup_args().parse_args())
File "interactive.py", line 45, in interactive
world.parley()
File "/usr/local/miniconda/envs/py36/lib/python3.6/site-packages/parlai-0.1.0-py3.6.egg/parlai/core/worlds.py", line 240, in parley
acts[1] = agents[1].act()
File "/usr/local/miniconda/envs/py36/lib/python3.6/site-packages/parlai-0.1.0-py3.6.egg/parlai/agents/drqa/drqa.py", line 177, in act
ex = self._build_ex(self.observation)
File "/usr/local/miniconda/envs/py36/lib/python3.6/site-packages/parlai-0.1.0-py3.6.egg/parlai/agents/drqa/drqa.py", line 259, in _build_ex
inputs['document'], doc_spans = self.word_dict.span_tokenize(document)
File "/usr/local/miniconda/envs/py36/lib/python3.6/site-packages/parlai-0.1.0-py3.6.egg/parlai/core/dict.py", line 324, in span_tokenize
return self.spacy_span_tokenize(text)
File "/usr/local/miniconda/envs/py36/lib/python3.6/site-packages/parlai-0.1.0-py3.6.egg/parlai/core/dict.py", line 296, in spacy_span_tokenize
tokens = self.NLP.tokenizer(text.encode('gbk').decode('utf-8'))
UnicodeEncodeError: 'gbk' codec can't encode character '\udce8' in position 0: illegal multibyte sequence
My original motivation was to make Parlai support Chinese, so I digged some information about Spacy. In Parlai code, there is a line spacy.load("en") , so I'm looking for something like spacy.load("ch“). Turns out on Spacy's "Models" page, there is no Chinese language, but only a multi-language model. Looks like the multi-language model is from Wikipedia, so I guess it should support Chinese ?
So I did a _self.NLP = spacy.load('xx_ent_wikism') in core.dict.py . When I input some UTF-8 Chinese questions, I got the following error :
Enter Your Message: 蓝色是颜色. \n 蓝色是什么? Traceback (most recent call last): File "interactive.py", line 56, in
interactive(setup_args().parse_args())
File "interactive.py", line 45, in interactive
world.parley()
File "/usr/local/miniconda/envs/py36/lib/python3.6/site-packages/parlai-0.1.0-py3.6.egg/parlai/core/worlds.py", line 240, in parley
acts[1] = agents[1].act()
File "/usr/local/miniconda/envs/py36/lib/python3.6/site-packages/parlai-0.1.0-py3.6.egg/parlai/agents/drqa/drqa.py", line 177, in act
ex = self._build_ex(self.observation)
File "/usr/local/miniconda/envs/py36/lib/python3.6/site-packages/parlai-0.1.0-py3.6.egg/parlai/agents/drqa/drqa.py", line 259, in _build_ex
inputs['document'], doc_spans = self.word_dict.span_tokenize(document)
File "/usr/local/miniconda/envs/py36/lib/python3.6/site-packages/parlai-0.1.0-py3.6.egg/parlai/core/dict.py", line 324, in span_tokenize
return self.spacy_span_tokenize(text)
File "/usr/local/miniconda/envs/py36/lib/python3.6/site-packages/parlai-0.1.0-py3.6.egg/parlai/core/dict.py", line 296, in spacy_span_tokenize
tokens = self.NLP.tokenizer(text)
File "tokenizer.pyx", line 100, in spacy.tokenizer.Tokenizer.call
File "strings.pyx", line 21, in spacy.strings.hash_string
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-14: surrogates not allowed
I tried a encode ('utf-8') , encode('utf-8').decode('utf-8') , neither worked. So I switched to encode('gbk') as follows and the program still won't pass :
Enter Your Message: 蓝色是颜色. \n n蓝色是什么? Traceback (most recent call last): File "interactive.py", line 56, in
interactive(setup_args().parse_args())
File "interactive.py", line 45, in interactive
world.parley()
File "/usr/local/miniconda/envs/py36/lib/python3.6/site-packages/parlai-0.1.0-py3.6.egg/parlai/core/worlds.py", line 240, in parley
acts[1] = agents[1].act()
File "/usr/local/miniconda/envs/py36/lib/python3.6/site-packages/parlai-0.1.0-py3.6.egg/parlai/agents/drqa/drqa.py", line 177, in act
ex = self._build_ex(self.observation)
File "/usr/local/miniconda/envs/py36/lib/python3.6/site-packages/parlai-0.1.0-py3.6.egg/parlai/agents/drqa/drqa.py", line 259, in _build_ex
inputs['document'], doc_spans = self.word_dict.span_tokenize(document)
File "/usr/local/miniconda/envs/py36/lib/python3.6/site-packages/parlai-0.1.0-py3.6.egg/parlai/core/dict.py", line 324, in span_tokenize
return self.spacy_span_tokenize(text)
File "/usr/local/miniconda/envs/py36/lib/python3.6/site-packages/parlai-0.1.0-py3.6.egg/parlai/core/dict.py", line 296, in spacy_span_tokenize
tokens = self.NLP.tokenizer(text.encode('gbk').decode('utf-8'))
UnicodeEncodeError: 'gbk' codec can't encode character '\udce8' in position 0: illegal multibyte sequence