AmoseKang / DrQA_cn

Other
37 stars 22 forks source link

corenlp tokenizer does‘t work, why? #5

Open hoogang opened 6 years ago

hoogang commented 6 years ago
pexpect.exceptions.TIMEOUT: Timeout exceeded.
<pexpect.pty_spawn.spawn object at 0x7f35015e5828>
command: /bin/bash
args: ['/bin/bash']
buffer (last 100 chars): b'er-white:~/DrQA$ [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize\r\n'
before (last 100 chars): b'er-white:~/DrQA$ [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize\r\n'
after: <class 'pexpect.exceptions.TIMEOUT'>
match: None
match_index: None
exitstatus: None
flag_eof: False
pid: 12672
child_fd: 21
closed: False
timeout: 60
delimiter: <class 'pexpect.exceptions.EOF'>
logfile: None
logfile_read: None
logfile_send: None
maxread: 100000
ignorecase: False
searchwindowsize: None
delaybeforesend: 0
delayafterclose: 0.1
delayafterterminate: 0.1
searcher: searcher_string:
    0: "b'NLP>'"
^CProcess ForkPoolWorker-34:
Process ForkPoolWorker-33:
Process ForkPoolWorker-25:
Process ForkPoolWorker-35:
Process ForkPoolWorker-31:
hoogang commented 6 years ago

I follow the setup, " install corenlp with Chinese package according to CoreNLP offical, you may specific classpath in environment or in file drqa\tokenizers\Zh_tokenizer.py. Then you may download vectors and training sets to start your work." but it does‘t work

hoogang commented 6 years ago

I try: from drqa.tokenizers import CoreNLPTokenizer tok = CoreNLPTokenizer()

run this script

“ python scripts/reader/preprocess.py data/datasets data/datasets --split SQuAD-v1.1-train --tokenizer corenlp ”

is all ok

but I test CoreNLPTokenizer in Chinese word segmentation。

>>> from drqa.tokenizers import CoreNLPTokenizer
>>> tok = CoreNLPTokenizer()
[init tokenizer done]
>>> tok.tokenize('hello world 湖北省武汉市公共交通系统').words()
['hello', 'world', '湖北省', '武汉市', '公共', '交通', '系统']

cmd is OK see as as follows

hugang@server-white:~$ java   -mx3g  -cp    "/home/hugang/DrQA/data/corenlp/*" edu.stanford.nlp.pipeline.StanfordCoreNLP    -annotators tokenize,ssplit,pos,lemma,ner -props StanfordCoreNLP-chinese.properties
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Loading Chinese dictionaries from 1 file:
[main] INFO edu.stanford.nlp.wordseg.ChineseDictionary -   edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz
[main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Done. Unique words in ChineseDictionary is: 423200.
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/segmenter/chinese/ctb.gz ... done [12.3 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger ... done [2.9 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz ... done [3.8 sec].

Entering interactive shell. Type q RETURN or EOF to quit.

NLP> 湖北省武安市 今天天气很不错 可以出去郊游
Sentence #1 (9 tokens):
湖北省武安市 今天天气很不错 可以出去郊游
[Text=湖北省 CharacterOffsetBegin=0 CharacterOffsetEnd=3 PartOfSpeech=NR Lemma=湖北省 NamedEntityTag=GPE]
[Text=武安市 CharacterOffsetBegin=3 CharacterOffsetEnd=6 PartOfSpeech=NR Lemma=武安市 NamedEntityTag=GPE]
[Text=今天 CharacterOffsetBegin=7 CharacterOffsetEnd=9 PartOfSpeech=NT Lemma=今天 NamedEntityTag=DATE NormalizedNamedEntityTag=XXXX-XX-XX]
[Text=天气 CharacterOffsetBegin=9 CharacterOffsetEnd=11 PartOfSpeech=NN Lemma=天气 NamedEntityTag=O]
[Text=很 CharacterOffsetBegin=11 CharacterOffsetEnd=12 PartOfSpeech=AD Lemma=很 NamedEntityTag=O]
[Text=不错 CharacterOffsetBegin=12 CharacterOffsetEnd=14 PartOfSpeech=VA Lemma=不错 NamedEntityTag=O]
[Text=可以 CharacterOffsetBegin=15 CharacterOffsetEnd=17 PartOfSpeech=VV Lemma=可以 NamedEntityTag=O]
[Text=出去 CharacterOffsetBegin=17 CharacterOffsetEnd=19 PartOfSpeech=VV Lemma=出去 NamedEntityTag=O]
[Text=郊游 CharacterOffsetBegin=19 CharacterOffsetEnd=21 PartOfSpeech=VV Lemma=郊游 NamedEntityTag=O]

but when I run this script

“python scripts/reader/preprocess.py data/datasets data/datasets --split webqa-test --tokenizer corenlp”

  "webqa-test"    is test set for Chinese reading comprehension

Traceback (most recent call last):
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/pool.py", line 103, in worker
    initializer(*initargs)
  File "scripts/reader/preprocess.py", line 29, in init
    TOK = tokenizer_class(**options)
  File "/home/hugang/DrQA/drqa/tokenizers/corenlp_tokenizer.py", line 37, in __init__
    self._launch()
  File "/home/hugang/DrQA/drqa/tokenizers/corenlp_tokenizer.py", line 68, in _launch
    self.corenlp.expect_exact('NLP>', searchwindowsize=100)
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pexpect/spawnbase.py", line 390, in expect_exact
    return exp.expect_loop(timeout)
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pexpect/expect.py", line 107, in expect_loop
    return self.timeout(e)
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pexpect/expect.py", line 70, in timeout
    raise TIMEOUT(msg)
pexpect.exceptions.TIMEOUT: Timeout exceeded.
<pexpect.pty_spawn.spawn object at 0x7f19c6d072b0>
command: /bin/bash
args: ['/bin/bash']
buffer (last 100 chars): b'er-white:~/DrQA$ [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize\r\n'
before (last 100 chars): b'er-white:~/DrQA$ [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize\r\n'
after: <class 'pexpect.exceptions.TIMEOUT'>
match: None
match_index: None
exitstatus: None
flag_eof: False
pid: 15458
child_fd: 21
closed: False
timeout: 60
delimiter: <class 'pexpect.exceptions.EOF'>
logfile: None
logfile_read: None
logfile_send: None
maxread: 100000
ignorecase: False
searchwindowsize: None
delaybeforesend: 0
delayafterclose: 0.1
delayafterterminate: 0.1
searcher: searcher_string:
    0: "b'NLP>'"

error is same last time, I try lots of methods and still can't solve this problem, which makes me confused, please help me.

AmoseKang commented 6 years ago

First, you need to change corenlp path directly in my code. You should find a fixme in ZhTokenizer class. Second, as an old issue reported, pexpect package ver 4.4 may have unwanted behavior, make sure you have a latest build. If you still meets the problem, try to print actual command directly and see if it can work.

CCNUdhj commented 5 years ago

你好,我是华师的同僚,我也遇到这个问题了,困扰了好久,请问你是怎么解决的啊,好难受啊

CCNUdhj commented 5 years ago

I try: from drqa.tokenizers import CoreNLPTokenizer tok = CoreNLPTokenizer()

run this script

“ python scripts/reader/preprocess.py data/datasets data/datasets --split SQuAD-v1.1-train --tokenizer corenlp ”

is all ok

but I test CoreNLPTokenizer in Chinese word segmentation。

>>> from drqa.tokenizers import CoreNLPTokenizer
>>> tok = CoreNLPTokenizer()
[init tokenizer done]
>>> tok.tokenize('hello world 湖北省武汉市公共交通系统').words()
['hello', 'world', '湖北省', '武汉市', '公共', '交通', '系统']

cmd is OK see as as follows

hugang@server-white:~$ java   -mx3g  -cp    "/home/hugang/DrQA/data/corenlp/*" edu.stanford.nlp.pipeline.StanfordCoreNLP    -annotators tokenize,ssplit,pos,lemma,ner -props StanfordCoreNLP-chinese.properties
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Loading Chinese dictionaries from 1 file:
[main] INFO edu.stanford.nlp.wordseg.ChineseDictionary -   edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz
[main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Done. Unique words in ChineseDictionary is: 423200.
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/segmenter/chinese/ctb.gz ... done [12.3 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger ... done [2.9 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz ... done [3.8 sec].

Entering interactive shell. Type q RETURN or EOF to quit.

NLP> 湖北省武安市 今天天气很不错 可以出去郊游
Sentence #1 (9 tokens):
湖北省武安市 今天天气很不错 可以出去郊游
[Text=湖北省 CharacterOffsetBegin=0 CharacterOffsetEnd=3 PartOfSpeech=NR Lemma=湖北省 NamedEntityTag=GPE]
[Text=武安市 CharacterOffsetBegin=3 CharacterOffsetEnd=6 PartOfSpeech=NR Lemma=武安市 NamedEntityTag=GPE]
[Text=今天 CharacterOffsetBegin=7 CharacterOffsetEnd=9 PartOfSpeech=NT Lemma=今天 NamedEntityTag=DATE NormalizedNamedEntityTag=XXXX-XX-XX]
[Text=天气 CharacterOffsetBegin=9 CharacterOffsetEnd=11 PartOfSpeech=NN Lemma=天气 NamedEntityTag=O]
[Text=很 CharacterOffsetBegin=11 CharacterOffsetEnd=12 PartOfSpeech=AD Lemma=很 NamedEntityTag=O]
[Text=不错 CharacterOffsetBegin=12 CharacterOffsetEnd=14 PartOfSpeech=VA Lemma=不错 NamedEntityTag=O]
[Text=可以 CharacterOffsetBegin=15 CharacterOffsetEnd=17 PartOfSpeech=VV Lemma=可以 NamedEntityTag=O]
[Text=出去 CharacterOffsetBegin=17 CharacterOffsetEnd=19 PartOfSpeech=VV Lemma=出去 NamedEntityTag=O]
[Text=郊游 CharacterOffsetBegin=19 CharacterOffsetEnd=21 PartOfSpeech=VV Lemma=郊游 NamedEntityTag=O]

but when I run this script

“python scripts/reader/preprocess.py data/datasets data/datasets --split webqa-test --tokenizer corenlp”

  "webqa-test"    is test set for Chinese reading comprehension
Traceback (most recent call last):
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/pool.py", line 103, in worker
    initializer(*initargs)
  File "scripts/reader/preprocess.py", line 29, in init
    TOK = tokenizer_class(**options)
  File "/home/hugang/DrQA/drqa/tokenizers/corenlp_tokenizer.py", line 37, in __init__
    self._launch()
  File "/home/hugang/DrQA/drqa/tokenizers/corenlp_tokenizer.py", line 68, in _launch
    self.corenlp.expect_exact('NLP>', searchwindowsize=100)
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pexpect/spawnbase.py", line 390, in expect_exact
    return exp.expect_loop(timeout)
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pexpect/expect.py", line 107, in expect_loop
    return self.timeout(e)
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pexpect/expect.py", line 70, in timeout
    raise TIMEOUT(msg)
pexpect.exceptions.TIMEOUT: Timeout exceeded.
<pexpect.pty_spawn.spawn object at 0x7f19c6d072b0>
command: /bin/bash
args: ['/bin/bash']
buffer (last 100 chars): b'er-white:~/DrQA$ [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize\r\n'
before (last 100 chars): b'er-white:~/DrQA$ [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize\r\n'
after: <class 'pexpect.exceptions.TIMEOUT'>
match: None
match_index: None
exitstatus: None
flag_eof: False
pid: 15458
child_fd: 21
closed: False
timeout: 60
delimiter: <class 'pexpect.exceptions.EOF'>
logfile: None
logfile_read: None
logfile_send: None
maxread: 100000
ignorecase: False
searchwindowsize: None
delaybeforesend: 0
delayafterclose: 0.1
delayafterterminate: 0.1
searcher: searcher_string:
    0: "b'NLP>'"

error is same last time, I try lots of methods and still can't solve this problem, which makes me confused, please help me.

你好,我是对面华师的同僚,我也遇到这个问题了,困扰了好久,请问你是怎么解决的啊,好难受啊

hoogang commented 5 years ago

你好,我是华师的同僚,我也遇到这个问题了,困扰了好久,请问你是怎么解决的啊,好难受啊 I solve the bug following:

python scripts/reader/preprocess.py data/datasets data/datasets --split webqa-test --tokenizer corenlp - --workers 1 add thread parameter (--workers 1 )

But I meet another problem:
followed DrQA, I used the processed Chinese Wikipedia data to transform the Tfidf-model data.

hugang@server-white:~/DrQA$ python ./scripts/retriever/build_tfidf.py data/wikipedia/wiki_zhs.db  data/wikipedia --ngram 4 --hash-size 2 --tokenizer corenlp
12/27/2018 09:07:12 PM: [ Counting words... ]
12/27/2018 09:07:14 PM: [ Mapping... ]
12/27/2018 09:07:14 PM: [ -------------------------Batch 1/11------------------------- ]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
Process ForkPoolWorker-5:
Traceback (most recent call last):
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pexpect/expect.py", line 111, in expect_loop
    incoming = spawn.read_nonblocking(spawn.maxread, timeout)
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pexpect/pty_spawn.py", line 482, in read_nonblocking
    raise TIMEOUT('Timeout exceeded.')
pexpect.exceptions.TIMEOUT: Timeout exceeded.
CCNUdhj commented 5 years ago

你好,我是华师的同僚,我也遇到这个问题了,困扰了好久,请问你是怎么解决的啊,好难受啊 I solve the bug following:

python scripts/reader/preprocess.py data/datasets data/datasets --split webqa-test --tokenizer corenlp - --workers 1 add thread parameter (--workers 1 )

But I meet another problem: followed DrQA, I used the processed Chinese Wikipedia data to transform the Tfidf-model data.

hugang@server-white:~/DrQA$ python ./scripts/retriever/build_tfidf.py data/wikipedia/wiki_zhs.db  data/wikipedia --ngram 4 --hash-size 2 --tokenizer corenlp
12/27/2018 09:07:12 PM: [ Counting words... ]
12/27/2018 09:07:14 PM: [ Mapping... ]
12/27/2018 09:07:14 PM: [ -------------------------Batch 1/11------------------------- ]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
Process ForkPoolWorker-5:
Traceback (most recent call last):
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pexpect/expect.py", line 111, in expect_loop
    incoming = spawn.read_nonblocking(spawn.maxread, timeout)
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pexpect/pty_spawn.py", line 482, in read_nonblocking
    raise TIMEOUT('Timeout exceeded.')
pexpect.exceptions.TIMEOUT: Timeout exceeded.

谢谢解答,可以留个联系方式吗,我最近也在研究这个,有问题可以一起交流讨论一下。

hoogang commented 5 years ago

你好,我是华师的同僚,我也遇到这个问题了,困扰了好久,请问你是怎么解决的啊,好难受啊 I solve the bug following:

python scripts/reader/preprocess.py data/datasets data/datasets --split webqa-test --tokenizer corenlp - --workers 1 add thread parameter (--workers 1 ) But I meet another problem: followed DrQA, I used the processed Chinese Wikipedia data to transform the Tfidf-model data.

hugang@server-white:~/DrQA$ python ./scripts/retriever/build_tfidf.py data/wikipedia/wiki_zhs.db  data/wikipedia --ngram 4 --hash-size 2 --tokenizer corenlp
12/27/2018 09:07:12 PM: [ Counting words... ]
12/27/2018 09:07:14 PM: [ Mapping... ]
12/27/2018 09:07:14 PM: [ -------------------------Batch 1/11------------------------- ]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
Process ForkPoolWorker-5:
Traceback (most recent call last):
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pexpect/expect.py", line 111, in expect_loop
    incoming = spawn.read_nonblocking(spawn.maxread, timeout)
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pexpect/pty_spawn.py", line 482, in read_nonblocking
    raise TIMEOUT('Timeout exceeded.')
pexpect.exceptions.TIMEOUT: Timeout exceeded.

谢谢解答,可以留个联系方式吗,我最近也在研究这个,有问题可以一起交流讨论一下。

联系 QQ 349359883

hoogang commented 5 years ago

I try: from drqa.tokenizers import CoreNLPTokenizer tok = CoreNLPTokenizer() run this script

“ python scripts/reader/preprocess.py data/datasets data/datasets --split SQuAD-v1.1-train --tokenizer corenlp ”

is all ok

but I test CoreNLPTokenizer in Chinese word segmentation。

>>> from drqa.tokenizers import CoreNLPTokenizer
>>> tok = CoreNLPTokenizer()
[init tokenizer done]
>>> tok.tokenize('hello world 湖北省武汉市公共交通系统').words()
['hello', 'world', '湖北省', '武汉市', '公共', '交通', '系统']

cmd is OK see as as follows

hugang@server-white:~$ java   -mx3g  -cp    "/home/hugang/DrQA/data/corenlp/*" edu.stanford.nlp.pipeline.StanfordCoreNLP    -annotators tokenize,ssplit,pos,lemma,ner -props StanfordCoreNLP-chinese.properties
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Loading Chinese dictionaries from 1 file:
[main] INFO edu.stanford.nlp.wordseg.ChineseDictionary -   edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz
[main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Done. Unique words in ChineseDictionary is: 423200.
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/segmenter/chinese/ctb.gz ... done [12.3 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger ... done [2.9 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz ... done [3.8 sec].

Entering interactive shell. Type q RETURN or EOF to quit.

NLP> 湖北省武安市 今天天气很不错 可以出去郊游
Sentence #1 (9 tokens):
湖北省武安市 今天天气很不错 可以出去郊游
[Text=湖北省 CharacterOffsetBegin=0 CharacterOffsetEnd=3 PartOfSpeech=NR Lemma=湖北省 NamedEntityTag=GPE]
[Text=武安市 CharacterOffsetBegin=3 CharacterOffsetEnd=6 PartOfSpeech=NR Lemma=武安市 NamedEntityTag=GPE]
[Text=今天 CharacterOffsetBegin=7 CharacterOffsetEnd=9 PartOfSpeech=NT Lemma=今天 NamedEntityTag=DATE NormalizedNamedEntityTag=XXXX-XX-XX]
[Text=天气 CharacterOffsetBegin=9 CharacterOffsetEnd=11 PartOfSpeech=NN Lemma=天气 NamedEntityTag=O]
[Text=很 CharacterOffsetBegin=11 CharacterOffsetEnd=12 PartOfSpeech=AD Lemma=很 NamedEntityTag=O]
[Text=不错 CharacterOffsetBegin=12 CharacterOffsetEnd=14 PartOfSpeech=VA Lemma=不错 NamedEntityTag=O]
[Text=可以 CharacterOffsetBegin=15 CharacterOffsetEnd=17 PartOfSpeech=VV Lemma=可以 NamedEntityTag=O]
[Text=出去 CharacterOffsetBegin=17 CharacterOffsetEnd=19 PartOfSpeech=VV Lemma=出去 NamedEntityTag=O]
[Text=郊游 CharacterOffsetBegin=19 CharacterOffsetEnd=21 PartOfSpeech=VV Lemma=郊游 NamedEntityTag=O]

but when I run this script

“python scripts/reader/preprocess.py data/datasets data/datasets --split webqa-test --tokenizer corenlp”

  "webqa-test"    is test set for Chinese reading comprehension
Traceback (most recent call last):
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/pool.py", line 103, in worker
    initializer(*initargs)
  File "scripts/reader/preprocess.py", line 29, in init
    TOK = tokenizer_class(**options)
  File "/home/hugang/DrQA/drqa/tokenizers/corenlp_tokenizer.py", line 37, in __init__
    self._launch()
  File "/home/hugang/DrQA/drqa/tokenizers/corenlp_tokenizer.py", line 68, in _launch
    self.corenlp.expect_exact('NLP>', searchwindowsize=100)
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pexpect/spawnbase.py", line 390, in expect_exact
    return exp.expect_loop(timeout)
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pexpect/expect.py", line 107, in expect_loop
    return self.timeout(e)
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pexpect/expect.py", line 70, in timeout
    raise TIMEOUT(msg)
pexpect.exceptions.TIMEOUT: Timeout exceeded.
<pexpect.pty_spawn.spawn object at 0x7f19c6d072b0>
command: /bin/bash
args: ['/bin/bash']
buffer (last 100 chars): b'er-white:~/DrQA$ [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize\r\n'
before (last 100 chars): b'er-white:~/DrQA$ [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize\r\n'
after: <class 'pexpect.exceptions.TIMEOUT'>
match: None
match_index: None
exitstatus: None
flag_eof: False
pid: 15458
child_fd: 21
closed: False
timeout: 60
delimiter: <class 'pexpect.exceptions.EOF'>
logfile: None
logfile_read: None
logfile_send: None
maxread: 100000
ignorecase: False
searchwindowsize: None
delaybeforesend: 0
delayafterclose: 0.1
delayafterterminate: 0.1
searcher: searcher_string:
    0: "b'NLP>'"

error is same last time, I try lots of methods and still can't solve this problem, which makes me confused, please help me.

你好,我是对面华师的同僚,我也遇到这个问题了,困扰了好久,请问你是怎么解决的啊,好难受啊 这个 bug 解决了,使用linux时 经反复验证 升级 JAVA 11版本 +pexpect 4.6.0 全是最新的 mac系统不用担心。

AmoseKang commented 5 years ago

原因可能有很多,请结合具体错误分析。首先,我接到过反馈表示某些版本的pexpect有问题,请考虑更换不同版本的pexpect。其次,这些错误可能源于分词命令报错,请尝试print执行的具体命令,看看在命令行中能否运行。最后,建议转移到facebook官方源,我的代码好久没有维护了,数据集等等可以作为参考。

hoogang commented 5 years ago

原因可能有很多,请结合具体错误分析。首先,我接到过反馈表示某些版本的pexpect有问题,请考虑更换不同版本的pexpect。其次,这些错误可能源于分词命令报错,请尝试print执行的具体命令,看看在命令行中能否运行。最后,建议转移到facebook官方源,我的代码好久没有维护了,数据集等等可以作为参考。

谢谢 ,今天都解决了。。

lvs071103 commented 4 years ago

我遇到相同的问题,该死的脚本输出时加了背景色字符串,导致pexpect 不能匹配