Open xingzhoupy opened 6 years ago
Have you verified that the CoreNLPTokenizer is properly setup and works independently?
yeah, I try: from drqa.tokenizers import CoreNLPTokenizer tok = CoreNLPTokenizer() error is same last time,
Sounds like its a problem with your setup then -- have you followed the instructions for setting up CoreNLP?
Also see related discussion in #61 and #42
yeah, i setup this , use English is great , but I want to changg language,i download corenlp chinese and move corenlp
Hm. The tokenizers were only developed/tested to work with English, so you might have to do some digging. I'd start with directly working with the CoreNLP command line interface (java) and seeing if the errors are thrown on that end. The tokenizer in DrQA is just wrapping that command line with pexpect
-- unfortunately if something crashes on the java side pexpect
will timeout.
oh, Thanks ,i try it again.
hi, i solve this problem ,is my corenlp classpath error , I revised normal. but i run : python3 generate.py /tmp/yuyide/DrQA/data/formatA/ qa.txt /tmp/yuyide/DrQA/data/formatA/ error: 01/15/2018 05:01:15 PM: [ Processing 36181 question answer pairs... ] 01/15/2018 05:01:15 PM: [ Will save to /tmp/yuyide/DrQA/data/formatA/qa.dstrain and /tmp/yuyide/DrQA/data/formatA/qa.dsdev ] 01/15/2018 05:01:15 PM: [ Loading /tmp/yuyide/DrQA/data/docs-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz ] 01/15/2018 05:01:16 PM: [ Ranking documents (top 5 per question)... ] 01/15/2018 05:01:56 PM: [ Pre-tokenizing questions... ] 01/15/2018 05:02:42 PM: [ Searching documents... ] multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/home/innovate/anaconda3/lib/python3.6/multiprocessing/pool.py", line 119, in worker result = (True, func(*args, **kwds)) File "generate.py", line 169, in search_docs for j, paragraph in enumerate(re.split(r'\n+', fetch_text(doc_id))): File "/home/innovate/anaconda3/lib/python3.6/site-packages/regex-2017.12.12-py3.6-linux-x86_64.egg/regex.py", line 319, in split return _compile(pattern, flags, kwargs).split(string, maxsplit, concurrent) TypeError: expected string or buffer """
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "generate.py", line 312, in
And i successfully create sample.db and tfidf . so, What is the problem, how can I solve it? thanks
You're doing Chinese, correct? Again, I'm not familiar with the extent of the incompatibilities. The regex is failing, and t looks like you might have one of the following problems:
See related issue in #77
hi, Corenlp I used to generate a new tfidf, but do not know why, the implementation of the implementation of the phenomenon of stalling DS, there is no error [innovate @ xiaoi-gy-93 distant] $ python3 generate.py / tmp / yuyide / DrQA / data / formatA / qa1.txt / tmp / yuyide / DrQA / data / formatA / 01/17/2018 10:15:00 AM: [Processing 36181 question answer pairs ...] 01/17/2018 10:15:00 AM: [Will save to /tmp/yuyide/DrQA/data/formatA/qa1.dstrain and /tmp/yuyide/DrQA/data/formatA/qa1.dsdev] 01/17/2018 10:15:00 AM: [Loading /tmp/yuyide/DrQA/data/docs-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz] thanks
I try: from drqa.tokenizers import CoreNLPTokenizer tok = CoreNLPTokenizer()
run this script
“ python scripts/reader/preprocess.py data/datasets data/datasets --split SQuAD-v1.1-train --tokenizer corenlp ”
but I test CoreNLPTokenizer in Chinese word segmentation。
>>> from drqa.tokenizers import CoreNLPTokenizer
>>> tok = CoreNLPTokenizer()
[init tokenizer done]
>>> tok.tokenize('hello world 湖北省武汉市公共交通系统').words()
['hello', 'world', '湖北省', '武汉市', '公共', '交通', '系统']
cmd is OK see as as follows
hugang@server-white:~$ java -mx3g -cp "/home/hugang/DrQA/data/corenlp/*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -props StanfordCoreNLP-chinese.properties
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Loading Chinese dictionaries from 1 file:
[main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz
[main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Done. Unique words in ChineseDictionary is: 423200.
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/segmenter/chinese/ctb.gz ... done [12.3 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger ... done [2.9 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz ... done [3.8 sec].
Entering interactive shell. Type q RETURN or EOF to quit.
NLP> 湖北省武安市 今天天气很不错 可以出去郊游
Sentence #1 (9 tokens):
湖北省武安市 今天天气很不错 可以出去郊游
[Text=湖北省 CharacterOffsetBegin=0 CharacterOffsetEnd=3 PartOfSpeech=NR Lemma=湖北省 NamedEntityTag=GPE]
[Text=武安市 CharacterOffsetBegin=3 CharacterOffsetEnd=6 PartOfSpeech=NR Lemma=武安市 NamedEntityTag=GPE]
[Text=今天 CharacterOffsetBegin=7 CharacterOffsetEnd=9 PartOfSpeech=NT Lemma=今天 NamedEntityTag=DATE NormalizedNamedEntityTag=XXXX-XX-XX]
[Text=天气 CharacterOffsetBegin=9 CharacterOffsetEnd=11 PartOfSpeech=NN Lemma=天气 NamedEntityTag=O]
[Text=很 CharacterOffsetBegin=11 CharacterOffsetEnd=12 PartOfSpeech=AD Lemma=很 NamedEntityTag=O]
[Text=不错 CharacterOffsetBegin=12 CharacterOffsetEnd=14 PartOfSpeech=VA Lemma=不错 NamedEntityTag=O]
[Text=可以 CharacterOffsetBegin=15 CharacterOffsetEnd=17 PartOfSpeech=VV Lemma=可以 NamedEntityTag=O]
[Text=出去 CharacterOffsetBegin=17 CharacterOffsetEnd=19 PartOfSpeech=VV Lemma=出去 NamedEntityTag=O]
[Text=郊游 CharacterOffsetBegin=19 CharacterOffsetEnd=21 PartOfSpeech=VV Lemma=郊游 NamedEntityTag=O]
but when I run this script
“python scripts/reader/preprocess.py data/datasets data/datasets --split webqa-test --tokenizer corenlp”
"webqa-test" is test set for Chinese reading comprehension
Traceback (most recent call last):
File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/multiprocessing/pool.py", line 103, in worker
initializer(*initargs)
File "scripts/reader/preprocess.py", line 29, in init
TOK = tokenizer_class(**options)
File "/home/hugang/DrQA/drqa/tokenizers/corenlp_tokenizer.py", line 37, in __init__
self._launch()
File "/home/hugang/DrQA/drqa/tokenizers/corenlp_tokenizer.py", line 68, in _launch
self.corenlp.expect_exact('NLP>', searchwindowsize=100)
File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pexpect/spawnbase.py", line 390, in expect_exact
return exp.expect_loop(timeout)
File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pexpect/expect.py", line 107, in expect_loop
return self.timeout(e)
File "/home/hugang/.pyenv/versions/3.6.0/lib/python3.6/site-packages/pexpect/expect.py", line 70, in timeout
raise TIMEOUT(msg)
pexpect.exceptions.TIMEOUT: Timeout exceeded.
<pexpect.pty_spawn.spawn object at 0x7f19c6d072b0>
command: /bin/bash
args: ['/bin/bash']
buffer (last 100 chars): b'er-white:~/DrQA$ [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize\r\n'
before (last 100 chars): b'er-white:~/DrQA$ [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize\r\n'
after: <class 'pexpect.exceptions.TIMEOUT'>
match: None
match_index: None
exitstatus: None
flag_eof: False
pid: 15458
child_fd: 21
closed: False
timeout: 60
delimiter: <class 'pexpect.exceptions.EOF'>
logfile: None
logfile_read: None
logfile_send: None
maxread: 100000
ignorecase: False
searchwindowsize: None
delaybeforesend: 0
delayafterclose: 0.1
delayafterterminate: 0.1
searcher: searcher_string:
0: "b'NLP>'"
error is same last time, I try lots of methods and still can't solve this problem, which makes me confused, please help me.
@hoogang The same problem with you. Have you solved it ?
What to do in order to generate long or lengthy answers
@hoogang The same problem with you. Have you solved it ? You can find the solution here @URL https://github.com/AmoseKang/DrQA_cn/issues/5
Same issue here but the DrQA setup from golden retriever
Even I faced the timeout issue. However, it was because I hadn't set the correct CLASSPATH.
Hi, I uploaded this file and successfully generated the TFIDF file, and now I want to use formatA called in my QA data format: python generate.py / path / to / dataset / dir dataset / path / to / output / dir The following error occurred:
Another exception occurred during the handling of the above exception:
Backtracking (last call last): File "/home/innovate/anaconda3/lib/python3.6/multiprocessing/process.py", line 258, at _bootstrap self.run () File "/home/innovate/anaconda3/lib/python3.6/multiprocessing/process.py", line 93, running self._target (* self._args, * self._kwargs) File "/home/innovate/anaconda3/lib/python3.6/multiprocessing/pool.py", line 103, at worker Initial value setting item ( initargs) Initialize the file "generate.py", line 48 PROCESS_TOK = tokenizer_class (** tokenizer_opts) File "/tmp/yuyide/DrQA/drqa/tokenizers/corenlp_tokenizer.py", line 33, at init self._launch () In the _launch file "/tmp/yuyide/DrQA/drqa/tokenizers/corenlp_tokenizer.py", line 61 self.corenlp.expect_exact ('NLP>', searchwindowsize = 100) File "/home/innovate/anaconda3/lib/python3.6/site-packages/pexpect/spawnbase.py", line 390, expect_exact Return exp.expect_loop (timeout) File "/home/innovate/anaconda3/lib/python3.6/site-packages/pexpect/expect.py", line 107, in the expect_loop Return self.timeout (e) File "/home/innovate/anaconda3/lib/python3.6/site-packages/pexpect/expect.py", line 70, Timeout Increase TIMEOUT (MSG) pexpect.exceptions.TIMEOUT: Timeout. The at 0x7f80a43510f0
Command: / bin / bash
args: ['/ bin / bash']
(Last 100 characters): Innovation @ xiaoi-gy-93: / tmp / yuyide / DrQA / scripts / distant \ x07 [Innovation @ xiaoi-gy-93] '
Before (last 100 characters): Innovation @ xiaoi-gy-93: / tmp / yuyide / DrQA / scripts / distant \ x07 [Innovative @ xiaoi-gy-93 Far] $ '
After: <class'pexpect.exceptions.TIMEOUT '>
Matches: None
match_index: None
exitstatus: none
flag_eof: False
pid: 13589
child_fd: 39
Close: wrong
Timeout: 60
Delimiter: <class'pexpect.exceptions.EOF '>
Log file: None
logfile_read: none
logfile_send: none
maxread: 100000
ignorecase: wrong
searchwindowsize: none
delaybeforesend: 0
After the delay is closed: 0.1
delayafterterminate: 0.1
Searcher: searcher_string:
0: "b'NLP> '"
Process ForkPoolWorker-32:
Backtracking (last call last):
File "/home/innovate/anaconda3/lib/python3.6/site-packages/pexpect/expect.py", line 99, in the expect_loop
Incoming = spawn.read_nonblocking (spawn.maxread, timeout)
File "/home/innovate/anaconda3/lib/python3.6/site-packages/pexpect/pty_spawn.py", line 462, read_nonblocking
Increase TIMEOUT ('Timeout Timeout')
pexpect.exceptions.TIMEOUT: Timeout.
What is the problem, how can I solve it?
Thanks!