corenlp tokenizer does‘t work, why?

pexpect.exceptions.TIMEOUT: Timeout exceeded.
<pexpect.pty_spawn.spawn object at 0x7f35015e5828>
command: /bin/bash
args: ['/bin/bash']
buffer (last 100 chars): b'er-white:~/DrQA$ [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize\r\n'
before (last 100 chars): b'er-white:~/DrQA$ [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize\r\n'
after: <class 'pexpect.exceptions.TIMEOUT'>
match: None
match_index: None
exitstatus: None
flag_eof: False
pid: 12672
child_fd: 21
closed: False
timeout: 60
delimiter: <class 'pexpect.exceptions.EOF'>
logfile: None
logfile_read: None
logfile_send: None
maxread: 100000
ignorecase: False
searchwindowsize: None
delaybeforesend: 0
delayafterclose: 0.1
delayafterterminate: 0.1
searcher: searcher_string:
    0: "b'NLP>'"
^CProcess ForkPoolWorker-34:
Process ForkPoolWorker-33:
Process ForkPoolWorker-25:
Process ForkPoolWorker-35:
Process ForkPoolWorker-31:

I follow the setup, " install corenlp with Chinese package according to CoreNLP offical, you may specific classpath in environment or in file drqa\tokenizers\Zh_tokenizer.py. Then you may download vectors and training sets to start your work." but it does‘t work

I try: from drqa.tokenizers import CoreNLPTokenizer tok = CoreNLPTokenizer()

run this script

“ python scripts/reader/preprocess.py data/datasets data/datasets --split SQuAD-v1.1-train --tokenizer corenlp ”

is all ok

but I test CoreNLPTokenizer in Chinese word segmentation。

>>> from drqa.tokenizers import CoreNLPTokenizer
>>> tok = CoreNLPTokenizer()
[init tokenizer done]
>>> tok.tokenize('hello world 湖北省武汉市公共交通系统').words()
['hello', 'world', '湖北省', '武汉市', '公共', '交通', '系统']

cmd is OK see as as follows

hugang@server-white:~$ java   -mx3g  -cp    "/home/hugang/DrQA/data/corenlp/*" edu.stanford.nlp.pipeline.StanfordCoreNLP    -annotators tokenize,ssplit,pos,lemma,ner -props StanfordCoreNLP-chinese.properties
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Loading Chinese dictionaries from 1 file:
[main] INFO edu.stanford.nlp.wordseg.ChineseDictionary -   edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz
[main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Done. Unique words in ChineseDictionary is: 423200.
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/segmenter/chinese/ctb.gz ... done [12.3 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger ... done [2.9 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz ... done [3.8 sec].

Entering interactive shell. Type q RETURN or EOF to quit.

NLP> 湖北省武安市 今天天气很不错 可以出去郊游
Sentence #1 (9 tokens):
湖北省武安市 今天天气很不错 可以出去郊游
[Text=湖北省 CharacterOffsetBegin=0 CharacterOffsetEnd=3 PartOfSpeech=NR Lemma=湖北省 NamedEntityTag=GPE]
[Text=武安市 CharacterOffsetBegin=3 CharacterOffsetEnd=6 PartOfSpeech=NR Lemma=武安市 NamedEntityTag=GPE]
[Text=今天 CharacterOffsetBegin=7 CharacterOffsetEnd=9 PartOfSpeech=NT Lemma=今天 NamedEntityTag=DATE NormalizedNamedEntityTag=XXXX-XX-XX]
[Text=天气 CharacterOffsetBegin=9 CharacterOffsetEnd=11 PartOfSpeech=NN Lemma=天气 NamedEntityTag=O]
[Text=很 CharacterOffsetBegin=11 CharacterOffsetEnd=12 PartOfSpeech=AD Lemma=很 NamedEntityTag=O]
[Text=不错 CharacterOffsetBegin=12 CharacterOffsetEnd=14 PartOfSpeech=VA Lemma=不错 NamedEntityTag=O]
[Text=可以 CharacterOffsetBegin=15 CharacterOffsetEnd=17 PartOfSpeech=VV Lemma=可以 NamedEntityTag=O]
[Text=出去 CharacterOffsetBegin=17 CharacterOffsetEnd=19 PartOfSpeech=VV Lemma=出去 NamedEntityTag=O]
[Text=郊游 CharacterOffsetBegin=19 CharacterOffsetEnd=21 PartOfSpeech=VV Lemma=郊游 NamedEntityTag=O]

but when I run this script

“python scripts/reader/preprocess.py data/datasets data/datasets --split webqa-test --tokenizer corenlp”

  "webqa-test"    is test set for Chinese reading comprehension

Traceback (most recent call last):
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/pool.py", line 103, in worker
    initializer(*initargs)
  File "scripts/reader/preprocess.py", line 29, in init
    TOK = tokenizer_class(**options)
  File "/home/hugang/DrQA/drqa/tokenizers/corenlp_tokenizer.py", line 37, in __init__
    self._launch()
  File "/home/hugang/DrQA/drqa/tokenizers/corenlp_tokenizer.py", line 68, in _launch
    self.corenlp.expect_exact('NLP>', searchwindowsize=100)
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pexpect/spawnbase.py", line 390, in expect_exact
    return exp.expect_loop(timeout)
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pexpect/expect.py", line 107, in expect_loop
    return self.timeout(e)
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pexpect/expect.py", line 70, in timeout
    raise TIMEOUT(msg)
pexpect.exceptions.TIMEOUT: Timeout exceeded.
<pexpect.pty_spawn.spawn object at 0x7f19c6d072b0>
command: /bin/bash
args: ['/bin/bash']
buffer (last 100 chars): b'er-white:~/DrQA$ [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize\r\n'
before (last 100 chars): b'er-white:~/DrQA$ [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize\r\n'
after: <class 'pexpect.exceptions.TIMEOUT'>
match: None
match_index: None
exitstatus: None
flag_eof: False
pid: 15458
child_fd: 21
closed: False
timeout: 60
delimiter: <class 'pexpect.exceptions.EOF'>
logfile: None
logfile_read: None
logfile_send: None
maxread: 100000
ignorecase: False
searchwindowsize: None
delaybeforesend: 0
delayafterclose: 0.1
delayafterterminate: 0.1
searcher: searcher_string:
    0: "b'NLP>'"

error is same last time, I try lots of methods and still can't solve this problem, which makes me confused, please help me.

First, you need to change corenlp path directly in my code. You should find a fixme in ZhTokenizer class. Second, as an old issue reported, pexpect package ver 4.4 may have unwanted behavior, make sure you have a latest build. If you still meets the problem, try to print actual command directly and see if it can work.

你好，我是华师的同僚，我也遇到这个问题了，困扰了好久，请问你是怎么解决的啊，好难受啊

I try: from drqa.tokenizers import CoreNLPTokenizer tok = CoreNLPTokenizer()

run this script

“ python scripts/reader/preprocess.py data/datasets data/datasets --split SQuAD-v1.1-train --tokenizer corenlp ”

is all ok

but I test CoreNLPTokenizer in Chinese word segmentation。

>>> from drqa.tokenizers import CoreNLPTokenizer
>>> tok = CoreNLPTokenizer()
[init tokenizer done]
>>> tok.tokenize('hello world 湖北省武汉市公共交通系统').words()
['hello', 'world', '湖北省', '武汉市', '公共', '交通', '系统']

cmd is OK see as as follows

hugang@server-white:~$ java   -mx3g  -cp    "/home/hugang/DrQA/data/corenlp/*" edu.stanford.nlp.pipeline.StanfordCoreNLP    -annotators tokenize,ssplit,pos,lemma,ner -props StanfordCoreNLP-chinese.properties
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Loading Chinese dictionaries from 1 file:
[main] INFO edu.stanford.nlp.wordseg.ChineseDictionary -   edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz
[main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Done. Unique words in ChineseDictionary is: 423200.
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/segmenter/chinese/ctb.gz ... done [12.3 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger ... done [2.9 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz ... done [3.8 sec].

Entering interactive shell. Type q RETURN or EOF to quit.

NLP> 湖北省武安市 今天天气很不错 可以出去郊游
Sentence #1 (9 tokens):
湖北省武安市 今天天气很不错 可以出去郊游
[Text=湖北省 CharacterOffsetBegin=0 CharacterOffsetEnd=3 PartOfSpeech=NR Lemma=湖北省 NamedEntityTag=GPE]
[Text=武安市 CharacterOffsetBegin=3 CharacterOffsetEnd=6 PartOfSpeech=NR Lemma=武安市 NamedEntityTag=GPE]
[Text=今天 CharacterOffsetBegin=7 CharacterOffsetEnd=9 PartOfSpeech=NT Lemma=今天 NamedEntityTag=DATE NormalizedNamedEntityTag=XXXX-XX-XX]
[Text=天气 CharacterOffsetBegin=9 CharacterOffsetEnd=11 PartOfSpeech=NN Lemma=天气 NamedEntityTag=O]
[Text=很 CharacterOffsetBegin=11 CharacterOffsetEnd=12 PartOfSpeech=AD Lemma=很 NamedEntityTag=O]
[Text=不错 CharacterOffsetBegin=12 CharacterOffsetEnd=14 PartOfSpeech=VA Lemma=不错 NamedEntityTag=O]
[Text=可以 CharacterOffsetBegin=15 CharacterOffsetEnd=17 PartOfSpeech=VV Lemma=可以 NamedEntityTag=O]
[Text=出去 CharacterOffsetBegin=17 CharacterOffsetEnd=19 PartOfSpeech=VV Lemma=出去 NamedEntityTag=O]
[Text=郊游 CharacterOffsetBegin=19 CharacterOffsetEnd=21 PartOfSpeech=VV Lemma=郊游 NamedEntityTag=O]

but when I run this script

“python scripts/reader/preprocess.py data/datasets data/datasets --split webqa-test --tokenizer corenlp”

  "webqa-test"    is test set for Chinese reading comprehension

Traceback (most recent call last):
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/pool.py", line 103, in worker
    initializer(*initargs)
  File "scripts/reader/preprocess.py", line 29, in init
    TOK = tokenizer_class(**options)
  File "/home/hugang/DrQA/drqa/tokenizers/corenlp_tokenizer.py", line 37, in __init__
    self._launch()
  File "/home/hugang/DrQA/drqa/tokenizers/corenlp_tokenizer.py", line 68, in _launch
    self.corenlp.expect_exact('NLP>', searchwindowsize=100)
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pexpect/spawnbase.py", line 390, in expect_exact
    return exp.expect_loop(timeout)
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pexpect/expect.py", line 107, in expect_loop
    return self.timeout(e)
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pexpect/expect.py", line 70, in timeout
    raise TIMEOUT(msg)
pexpect.exceptions.TIMEOUT: Timeout exceeded.
<pexpect.pty_spawn.spawn object at 0x7f19c6d072b0>
command: /bin/bash
args: ['/bin/bash']
buffer (last 100 chars): b'er-white:~/DrQA$ [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize\r\n'
before (last 100 chars): b'er-white:~/DrQA$ [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize\r\n'
after: <class 'pexpect.exceptions.TIMEOUT'>
match: None
match_index: None
exitstatus: None
flag_eof: False
pid: 15458
child_fd: 21
closed: False
timeout: 60
delimiter: <class 'pexpect.exceptions.EOF'>
logfile: None
logfile_read: None
logfile_send: None
maxread: 100000
ignorecase: False
searchwindowsize: None
delaybeforesend: 0
delayafterclose: 0.1
delayafterterminate: 0.1
searcher: searcher_string:
    0: "b'NLP>'"

error is same last time, I try lots of methods and still can't solve this problem, which makes me confused, please help me.

你好，我是对面华师的同僚，我也遇到这个问题了，困扰了好久，请问你是怎么解决的啊，好难受啊

你好，我是华师的同僚，我也遇到这个问题了，困扰了好久，请问你是怎么解决的啊，好难受啊 I solve the bug following:

python scripts/reader/preprocess.py data/datasets data/datasets --split webqa-test --tokenizer corenlp - --workers 1 add thread parameter (--workers 1 )

But I meet another problem：
followed DrQA, I used the processed Chinese Wikipedia data to transform the Tfidf-model data.

hugang@server-white:~/DrQA$ python ./scripts/retriever/build_tfidf.py data/wikipedia/wiki_zhs.db  data/wikipedia --ngram 4 --hash-size 2 --tokenizer corenlp
12/27/2018 09:07:12 PM: [ Counting words... ]
12/27/2018 09:07:14 PM: [ Mapping... ]
12/27/2018 09:07:14 PM: [ -------------------------Batch 1/11------------------------- ]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
Process ForkPoolWorker-5:
Traceback (most recent call last):
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pexpect/expect.py", line 111, in expect_loop
    incoming = spawn.read_nonblocking(spawn.maxread, timeout)
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pexpect/pty_spawn.py", line 482, in read_nonblocking
    raise TIMEOUT('Timeout exceeded.')
pexpect.exceptions.TIMEOUT: Timeout exceeded.

你好，我是华师的同僚，我也遇到这个问题了，困扰了好久，请问你是怎么解决的啊，好难受啊 I solve the bug following:

python scripts/reader/preprocess.py data/datasets data/datasets --split webqa-test --tokenizer corenlp - --workers 1 add thread parameter (--workers 1 )

But I meet another problem： followed DrQA, I used the processed Chinese Wikipedia data to transform the Tfidf-model data.
hugang@server-white:~/DrQA$ python ./scripts/retriever/build_tfidf.py data/wikipedia/wiki_zhs.db  data/wikipedia --ngram 4 --hash-size 2 --tokenizer corenlp
12/27/2018 09:07:12 PM: [ Counting words... ]
12/27/2018 09:07:14 PM: [ Mapping... ]
12/27/2018 09:07:14 PM: [ -------------------------Batch 1/11------------------------- ]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
Process ForkPoolWorker-5:
Traceback (most recent call last):
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pexpect/expect.py", line 111, in expect_loop
    incoming = spawn.read_nonblocking(spawn.maxread, timeout)
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pexpect/pty_spawn.py", line 482, in read_nonblocking
    raise TIMEOUT('Timeout exceeded.')
pexpect.exceptions.TIMEOUT: Timeout exceeded.

谢谢解答，可以留个联系方式吗，我最近也在研究这个，有问题可以一起交流讨论一下。

你好，我是华师的同僚，我也遇到这个问题了，困扰了好久，请问你是怎么解决的啊，好难受啊 I solve the bug following:

python scripts/reader/preprocess.py data/datasets data/datasets --split webqa-test --tokenizer corenlp - --workers 1 add thread parameter (--workers 1 ) But I meet another problem： followed DrQA, I used the processed Chinese Wikipedia data to transform the Tfidf-model data.
hugang@server-white:~/DrQA$ python ./scripts/retriever/build_tfidf.py data/wikipedia/wiki_zhs.db  data/wikipedia --ngram 4 --hash-size 2 --tokenizer corenlp
12/27/2018 09:07:12 PM: [ Counting words... ]
12/27/2018 09:07:14 PM: [ Mapping... ]
12/27/2018 09:07:14 PM: [ -------------------------Batch 1/11------------------------- ]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
Process ForkPoolWorker-5:
Traceback (most recent call last):
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pexpect/expect.py", line 111, in expect_loop
    incoming = spawn.read_nonblocking(spawn.maxread, timeout)
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pexpect/pty_spawn.py", line 482, in read_nonblocking
    raise TIMEOUT('Timeout exceeded.')
pexpect.exceptions.TIMEOUT: Timeout exceeded.
谢谢解答，可以留个联系方式吗，我最近也在研究这个，有问题可以一起交流讨论一下。

联系 QQ 349359883

I try: from drqa.tokenizers import CoreNLPTokenizer tok = CoreNLPTokenizer() run this script

“ python scripts/reader/preprocess.py data/datasets data/datasets --split SQuAD-v1.1-train --tokenizer corenlp ”

is all ok

but I test CoreNLPTokenizer in Chinese word segmentation。

>>> from drqa.tokenizers import CoreNLPTokenizer
>>> tok = CoreNLPTokenizer()
[init tokenizer done]
>>> tok.tokenize('hello world 湖北省武汉市公共交通系统').words()
['hello', 'world', '湖北省', '武汉市', '公共', '交通', '系统']

cmd is OK see as as follows

hugang@server-white:~$ java   -mx3g  -cp    "/home/hugang/DrQA/data/corenlp/*" edu.stanford.nlp.pipeline.StanfordCoreNLP    -annotators tokenize,ssplit,pos,lemma,ner -props StanfordCoreNLP-chinese.properties
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Loading Chinese dictionaries from 1 file:
[main] INFO edu.stanford.nlp.wordseg.ChineseDictionary -   edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz
[main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Done. Unique words in ChineseDictionary is: 423200.
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/segmenter/chinese/ctb.gz ... done [12.3 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger ... done [2.9 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz ... done [3.8 sec].

Entering interactive shell. Type q RETURN or EOF to quit.

NLP> 湖北省武安市 今天天气很不错 可以出去郊游
Sentence #1 (9 tokens):
湖北省武安市 今天天气很不错 可以出去郊游
[Text=湖北省 CharacterOffsetBegin=0 CharacterOffsetEnd=3 PartOfSpeech=NR Lemma=湖北省 NamedEntityTag=GPE]
[Text=武安市 CharacterOffsetBegin=3 CharacterOffsetEnd=6 PartOfSpeech=NR Lemma=武安市 NamedEntityTag=GPE]
[Text=今天 CharacterOffsetBegin=7 CharacterOffsetEnd=9 PartOfSpeech=NT Lemma=今天 NamedEntityTag=DATE NormalizedNamedEntityTag=XXXX-XX-XX]
[Text=天气 CharacterOffsetBegin=9 CharacterOffsetEnd=11 PartOfSpeech=NN Lemma=天气 NamedEntityTag=O]
[Text=很 CharacterOffsetBegin=11 CharacterOffsetEnd=12 PartOfSpeech=AD Lemma=很 NamedEntityTag=O]
[Text=不错 CharacterOffsetBegin=12 CharacterOffsetEnd=14 PartOfSpeech=VA Lemma=不错 NamedEntityTag=O]
[Text=可以 CharacterOffsetBegin=15 CharacterOffsetEnd=17 PartOfSpeech=VV Lemma=可以 NamedEntityTag=O]
[Text=出去 CharacterOffsetBegin=17 CharacterOffsetEnd=19 PartOfSpeech=VV Lemma=出去 NamedEntityTag=O]
[Text=郊游 CharacterOffsetBegin=19 CharacterOffsetEnd=21 PartOfSpeech=VV Lemma=郊游 NamedEntityTag=O]

but when I run this script

“python scripts/reader/preprocess.py data/datasets data/datasets --split webqa-test --tokenizer corenlp”

  "webqa-test"    is test set for Chinese reading comprehension

Traceback (most recent call last):
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/pool.py", line 103, in worker
    initializer(*initargs)
  File "scripts/reader/preprocess.py", line 29, in init
    TOK = tokenizer_class(**options)
  File "/home/hugang/DrQA/drqa/tokenizers/corenlp_tokenizer.py", line 37, in __init__
    self._launch()
  File "/home/hugang/DrQA/drqa/tokenizers/corenlp_tokenizer.py", line 68, in _launch
    self.corenlp.expect_exact('NLP>', searchwindowsize=100)
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pexpect/spawnbase.py", line 390, in expect_exact
    return exp.expect_loop(timeout)
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pexpect/expect.py", line 107, in expect_loop
    return self.timeout(e)
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pexpect/expect.py", line 70, in timeout
    raise TIMEOUT(msg)
pexpect.exceptions.TIMEOUT: Timeout exceeded.
<pexpect.pty_spawn.spawn object at 0x7f19c6d072b0>
command: /bin/bash
args: ['/bin/bash']
buffer (last 100 chars): b'er-white:~/DrQA$ [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize\r\n'
before (last 100 chars): b'er-white:~/DrQA$ [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize\r\n'
after: <class 'pexpect.exceptions.TIMEOUT'>
match: None
match_index: None
exitstatus: None
flag_eof: False
pid: 15458
child_fd: 21
closed: False
timeout: 60
delimiter: <class 'pexpect.exceptions.EOF'>
logfile: None
logfile_read: None
logfile_send: None
maxread: 100000
ignorecase: False
searchwindowsize: None
delaybeforesend: 0
delayafterclose: 0.1
delayafterterminate: 0.1
searcher: searcher_string:
    0: "b'NLP>'"

error is same last time, I try lots of methods and still can't solve this problem, which makes me confused, please help me.

你好，我是对面华师的同僚，我也遇到这个问题了，困扰了好久，请问你是怎么解决的啊，好难受啊这个 bug 解决了，使用linux时经反复验证升级 JAVA 11版本 +pexpect 4.6.0 全是最新的 mac系统不用担心。

原因可能有很多，请结合具体错误分析。首先，我接到过反馈表示某些版本的pexpect有问题，请考虑更换不同版本的pexpect。其次，这些错误可能源于分词命令报错，请尝试print执行的具体命令，看看在命令行中能否运行。最后，建议转移到facebook官方源，我的代码好久没有维护了，数据集等等可以作为参考。

原因可能有很多，请结合具体错误分析。首先，我接到过反馈表示某些版本的pexpect有问题，请考虑更换不同版本的pexpect。其次，这些错误可能源于分词命令报错，请尝试print执行的具体命令，看看在命令行中能否运行。最后，建议转移到facebook官方源，我的代码好久没有维护了，数据集等等可以作为参考。

谢谢，今天都解决了。。

我遇到相同的问题，该死的脚本输出时加了背景色字符串，导致pexpect 不能匹配

AmoseKang / DrQA_cn