corenlp tokenizer does‘t work, why? #5

Open hoogang opened 6 years ago

hoogang commented 6 years ago
pexpect.exceptions.TIMEOUT: Timeout exceeded.
<pexpect.pty_spawn.spawn object at 0x7f35015e5828>
command: /bin/bash
args: ['/bin/bash']
buffer (last 100 chars): b'er-white:~/DrQA$ [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize\r\n'
before (last 100 chars): b'er-white:~/DrQA$ [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize\r\n'
after: <class 'pexpect.exceptions.TIMEOUT'>
match: None
match_index: None
exitstatus: None
flag_eof: False
pid: 12672
child_fd: 21
closed: False
timeout: 60
delimiter: <class 'pexpect.exceptions.EOF'>
logfile: None
logfile_read: None
logfile_send: None
maxread: 100000
ignorecase: False
searchwindowsize: None
delaybeforesend: 0
delayafterclose: 0.1
delayafterterminate: 0.1
searcher: searcher_string:
    0: "b'NLP>'"
^CProcess ForkPoolWorker-34:
Process ForkPoolWorker-33:
Process ForkPoolWorker-25:
Process ForkPoolWorker-35:
Process ForkPoolWorker-31:
hoogang commented 6 years ago

I follow the setup, " install corenlp with Chinese package according to CoreNLP offical, you may specific classpath in environment or in file drqa\tokenizers\ Then you may download vectors and training sets to start your work." but it does‘t work

hoogang commented 6 years ago

I try: from drqa.tokenizers import CoreNLPTokenizer tok = CoreNLPTokenizer()

run this script

“ python scripts/reader/ data/datasets data/datasets --split SQuAD-v1.1-train --tokenizer corenlp ”

is all ok

but I test CoreNLPTokenizer in Chinese word segmentation。

>>> from drqa.tokenizers import CoreNLPTokenizer
>>> tok = CoreNLPTokenizer()
[init tokenizer done]
>>> tok.tokenize('hello world 湖北省武汉市公共交通系统').words()
['hello', 'world', '湖北省', '武汉市', '公共', '交通', '系统']

cmd is OK see as as follows

hugang@server-white:~$ java   -mx3g  -cp    "/home/hugang/DrQA/data/corenlp/*" edu.stanford.nlp.pipeline.StanfordCoreNLP    -annotators tokenize,ssplit,pos,lemma,ner -props
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Loading Chinese dictionaries from 1 file:
[main] INFO edu.stanford.nlp.wordseg.ChineseDictionary -   edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz
[main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Done. Unique words in ChineseDictionary is: 423200.
[main] INFO - Loading classifier from edu/stanford/nlp/models/segmenter/chinese/ctb.gz ... done [12.3 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger ... done [2.9 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
[main] INFO - Loading classifier from edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz ... done [3.8 sec].

Entering interactive shell. Type q RETURN or EOF to quit.

NLP> 湖北省武安市 今天天气很不错 可以出去郊游
Sentence #1 (9 tokens):
湖北省武安市 今天天气很不错 可以出去郊游
[Text=湖北省 CharacterOffsetBegin=0 CharacterOffsetEnd=3 PartOfSpeech=NR Lemma=湖北省 NamedEntityTag=GPE]
[Text=武安市 CharacterOffsetBegin=3 CharacterOffsetEnd=6 PartOfSpeech=NR Lemma=武安市 NamedEntityTag=GPE]
[Text=今天 CharacterOffsetBegin=7 CharacterOffsetEnd=9 PartOfSpeech=NT Lemma=今天 NamedEntityTag=DATE NormalizedNamedEntityTag=XXXX-XX-XX]
[Text=天气 CharacterOffsetBegin=9 CharacterOffsetEnd=11 PartOfSpeech=NN Lemma=天气 NamedEntityTag=O]
[Text=很 CharacterOffsetBegin=11 CharacterOffsetEnd=12 PartOfSpeech=AD Lemma=很 NamedEntityTag=O]
[Text=不错 CharacterOffsetBegin=12 CharacterOffsetEnd=14 PartOfSpeech=VA Lemma=不错 NamedEntityTag=O]
[Text=可以 CharacterOffsetBegin=15 CharacterOffsetEnd=17 PartOfSpeech=VV Lemma=可以 NamedEntityTag=O]
[Text=出去 CharacterOffsetBegin=17 CharacterOffsetEnd=19 PartOfSpeech=VV Lemma=出去 NamedEntityTag=O]
[Text=郊游 CharacterOffsetBegin=19 CharacterOffsetEnd=21 PartOfSpeech=VV Lemma=郊游 NamedEntityTag=O]

but when I run this script

“python scripts/reader/ data/datasets data/datasets --split webqa-test --tokenizer corenlp”

  "webqa-test"    is test set for Chinese reading comprehension

error is same last time, I try lots of methods and still can't solve this problem, which makes me confused, please help me.

AmoseKang commented 6 years ago

First, you need to change corenlp path directly in my code. You should find a fixme in ZhTokenizer class. Second, as an old issue reported, pexpect package ver 4.4 may have unwanted behavior, make sure you have a latest build. If you still meets the problem, try to print actual command directly and see if it can work.

CCNUdhj commented 5 years ago


CCNUdhj commented 5 years ago

error is same last time, I try lots of methods and still can't solve this problem, which makes me confused, please help me.


hoogang commented 5 years ago

你好,我是华师的同僚,我也遇到这个问题了,困扰了好久,请问你是怎么解决的啊,好难受啊 I solve the bug following:

python scripts/reader/ data/datasets data/datasets --split webqa-test --tokenizer corenlp - --workers 1 add thread parameter (--workers 1 )

But I meet another problem:
followed DrQA, I used the processed Chinese Wikipedia data to transform the Tfidf-model data.

hugang@server-white:~/DrQA$ python ./scripts/retriever/ data/wikipedia/wiki_zhs.db  data/wikipedia --ngram 4 --hash-size 2 --tokenizer corenlp
12/27/2018 09:07:12 PM: [ Counting words... ]
12/27/2018 09:07:14 PM: [ Mapping... ]
12/27/2018 09:07:14 PM: [ -------------------------Batch 1/11------------------------- ]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
Process ForkPoolWorker-5:
Traceback (most recent call last):
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pexpect/", line 111, in expect_loop
    incoming = spawn.read_nonblocking(spawn.maxread, timeout)
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pexpect/", line 482, in read_nonblocking
    raise TIMEOUT('Timeout exceeded.')
pexpect.exceptions.TIMEOUT: Timeout exceeded.
CCNUdhj commented 5 years ago

你好,我是华师的同僚,我也遇到这个问题了,困扰了好久,请问你是怎么解决的啊,好难受啊 I solve the bug following:

python scripts/reader/ data/datasets data/datasets --split webqa-test --tokenizer corenlp - --workers 1 add thread parameter (--workers 1 )

But I meet another problem: followed DrQA, I used the processed Chinese Wikipedia data to transform the Tfidf-model data.

hugang@server-white:~/DrQA$ python ./scripts/retriever/ data/wikipedia/wiki_zhs.db  data/wikipedia --ngram 4 --hash-size 2 --tokenizer corenlp
12/27/2018 09:07:12 PM: [ Counting words... ]
12/27/2018 09:07:14 PM: [ Mapping... ]
12/27/2018 09:07:14 PM: [ -------------------------Batch 1/11------------------------- ]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
Process ForkPoolWorker-5:
Traceback (most recent call last):
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pexpect/", line 111, in expect_loop
    incoming = spawn.read_nonblocking(spawn.maxread, timeout)
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pexpect/", line 482, in read_nonblocking
    raise TIMEOUT('Timeout exceeded.')
pexpect.exceptions.TIMEOUT: Timeout exceeded.


hoogang commented 5 years ago

你好,我是华师的同僚,我也遇到这个问题了,困扰了好久,请问你是怎么解决的啊,好难受啊 I solve the bug following:

python scripts/reader/ data/datasets data/datasets --split webqa-test --tokenizer corenlp - --workers 1 add thread parameter (--workers 1 ) But I meet another problem: followed DrQA, I used the processed Chinese Wikipedia data to transform the Tfidf-model data.

hugang@server-white:~/DrQA$ python ./scripts/retriever/ data/wikipedia/wiki_zhs.db  data/wikipedia --ngram 4 --hash-size 2 --tokenizer corenlp
12/27/2018 09:07:12 PM: [ Counting words... ]
12/27/2018 09:07:14 PM: [ Mapping... ]
12/27/2018 09:07:14 PM: [ -------------------------Batch 1/11------------------------- ]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
[init tokenizer done]
Process ForkPoolWorker-5:
Traceback (most recent call last):
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pexpect/", line 111, in expect_loop
    incoming = spawn.read_nonblocking(spawn.maxread, timeout)
  File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pexpect/", line 482, in read_nonblocking
    raise TIMEOUT('Timeout exceeded.')
pexpect.exceptions.TIMEOUT: Timeout exceeded.


联系 QQ 349359883

hoogang commented 5 years ago

你好,我是对面华师的同僚,我也遇到这个问题了,困扰了好久,请问你是怎么解决的啊,好难受啊 这个 bug 解决了,使用linux时 经反复验证 升级 JAVA 11版本 +pexpect 4.6.0 全是最新的 mac系统不用担心。

AmoseKang commented 5 years ago


hoogang commented 5 years ago


谢谢 ,今天都解决了。。

lvs071103 commented 4 years ago

我遇到相同的问题,该死的脚本输出时加了背景色字符串,导致pexpect 不能匹配