facebookresearch / DrQA

Reading Wikipedia to Answer Open-Domain Questions
Other
4.48k stars 898 forks source link

i meet some problem in preprocess.py ->preprocess_dataset and train.py #42

Closed oooozhizhi closed 6 years ago

oooozhizhi commented 7 years ago

I want to train the DocReader, I follow the introduction of 'DrQA/scripts/reader/' but when i run preprocess.py this happened:

Traceback (most recent call last): File "/usr/lib/python3/dist-packages/pexpect/expect.py", line 97, in expect_loop incoming = spawn.read_nonblocking(spawn.maxread, timeout) File "/usr/lib/python3/dist-packages/pexpect/pty_spawn.py", line 452, in read_nonblocking raise TIMEOUT('Timeout exceeded.') pexpect.exceptions.TIMEOUT: Timeout exceeded.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap self.run() File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run self._target(*self._args, *self._kwargs) File "/usr/lib/python3.5/multiprocessing/pool.py", line 103, in worker initializer(initargs) File "preprocess.py", line 30, in init TOK = tokenizer_class(**options) File "/home/dzj/facebook_mc/DrQA/drqa/tokenizers/corenlp_tokenizer.py", line 33, in init self._launch() File "/home/dzj/facebook_mc/DrQA/drqa/tokenizers/corenlp_tokenizer.py", line 61, in _launch self.corenlp.expect_exact('NLP>', searchwindowsize=100) File "/usr/lib/python3/dist-packages/pexpect/spawnbase.py", line 384, in expect_exact return exp.expect_loop(timeout) File "/usr/lib/python3/dist-packages/pexpect/expect.py", line 104, in expect_loop return self.timeout(e) File "/usr/lib/python3/dist-packages/pexpect/expect.py", line 68, in timeout raise TIMEOUT(msg) pexpect.exceptions.TIMEOUT: Timeout exceeded. <pexpect.pty_spawn.spawn object at 0x7fd4215bb588> command: /bin/bash args: ['/bin/bash'] searcher: None buffer (last 100 chars): b'plearning2: ~/facebook_mc/DrQA/scripts/reader\x07root@deeplearning2:~/facebook_mc/DrQA/scripts/reader# ' before (last 100 chars): b'plearning2: ~/facebook_mc/DrQA/scripts/reader\x07root@deeplearning2:~/facebook_mc/DrQA/scripts/reader# ' after: <class 'pexpect.exceptions.TIMEOUT'> match: None match_index: None exitstatus: None flag_eof: False pid: 76522 child_fd: 10 closed: False timeout: 10 delimiter: <class 'pexpect.exceptions.EOF'> logfile: None logfile_read: None logfile_send: None maxread: 100000 ignorecase: False searchwindowsize: None delaybeforesend: 0 delayafterclose: 0.1 delayafterterminate: 0.1

though i do this in the dir DrQA or DrQA/scripts/reader/ , this bug won't happend:

from drqa import tokenizers
tok = tokenizers.CoreNLPTokenizer()
tok.tokenize(text).words()

so i rewrite a no so good preprocess.py -> (change this two def )

def tokenizeSingle(aWorker,text):
    """Call the global process tokenizer on the input text."""
    #global TOK
    #tokens = TOK.tokenize(text)
    #print(text)
    #tokens = tokenizers.CoreNLPTokenizer().tokenize(text)
    tokens = aWorker.tokenize(text)
    #tokens = TOK.tokenizers(text)

    output = {
        'words': tokens.words(),
        'offsets': tokens.offsets(),
        'pos': tokens.pos(),
        'lemma': tokens.lemmas(),
        'ner': tokens.entities(),
    }
    return output
def process_dataset_test(data, tokenizer, workers=None):
    """Iterate processing (tokenize, parse, etc) dataset multithreaded."""
    aWorker = tokenizers.CoreNLPTokenizer()
    q_tokens = []
    for aStr in data['questions']:
        q_tokens.append(tokenizeSingle(aWorker,aStr))
    c_tokens = []
    for aStr in data['contexts']:
        c_tokens.append(tokenizeSingle(aWorker,aStr))
    for idx in range(len(data['qids'])):
        question = q_tokens[idx]['words']
        qlemma = q_tokens[idx]['lemma']
        document = c_tokens[data['qid2cid'][idx]]['words']
        offsets = c_tokens[data['qid2cid'][idx]]['offsets']
        lemma = c_tokens[data['qid2cid'][idx]]['lemma']
        pos = c_tokens[data['qid2cid'][idx]]['pos']
        ner = c_tokens[data['qid2cid'][idx]]['ner']
        ans_tokens = []
        if len(data['answers']) > 0:
            for ans in data['answers'][idx]:
                found = find_answer(offsets,
                                    ans['answer_start'],
                                    ans['answer_start'] + len(ans['text']))
                if found:
                    ans_tokens.append(found)
        yield {
            'id': data['qids'][idx],
            'question': question,
            'document': document,
            'offsets': offsets,
            'answers': ans_tokens,
            'qlemma': qlemma,
            'lemma': lemma,
            'pos': pos,
            'ner': ner,
        }

but i found that , in the output the value of the 'pos' and 'ner' are null i don't know whether i had writen a right process_dataset()

then i run the train.py , i get this bug...

09/30/2017 03:15:53 PM: [ Starting training... ] THCudaCheck FAIL file=/b/wheel/pytorch-src/torch/lib/THC/THCCachingHostAllocator.cpp line=258 error=77 : an illegal memory access was encountered Traceback (most recent call last): File "train.py", line 562, in main(args) File "train.py", line 500, in main train(args, train_loader, model, stats) File "train.py", line 218, in train train_loss.update(model.update(ex)) File "/home/dzj/facebook_mc/DrQA/drqa/reader/model.py", line 218, in update score_s, score_e = self.network(inputs) File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 206, in call result = self.forward(*input, **kwargs) File "/home/dzj/facebook_mc/DrQA/drqa/reader/rnn_reader.py", line 110, in forward training=self.training) File "/usr/local/lib/python3.5/dist-packages/torch/nn/functional.py", line 366, in dropout return _functions.dropout.Dropout(p, training, inplace)(input) File "/usr/local/lib/python3.5/dist-packages/torch/nn/functions/dropout.py", line 29, in forward self.noise.bernoulli(1 - self.p).div_(1 - self.p) RuntimeError: Creating MTGP constants failed. at /b/wheel/pytorch-src/torch/lib/THC/THCTensorRandom.cu:33

1 can you give me a sample of the output of preprocess.py? 2 why this TIMEOUT('Timeout exceeded.') will happened, my test has passed ... 3 What is the last bug?

ajfisch commented 7 years ago

The TIMEOUT is because you have some problem with your CoreNLP. Did you download all the model jars? If it works with the default but not with ner, pos, and lemma tags then you likely only downloaded the tokenizer jars.

Does

java -cp "/Users/afisch/github/DrQA/data/corenlp/*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner

work? (obviously change the classpath).

ajfisch commented 6 years ago

Closing due to lack of response. Feel free to reopen if you still have problems!

haozheji commented 6 years ago

Running the above command gives:

Exception in thread "main" java.lang.UnsupportedClassVersionError: edu/stanford/nlp/pipeline/StanfordCoreNLP : Unsupported major.minor version 52.0
        at java.lang.ClassLoader.defineClass1(Native Method)
        at java.lang.ClassLoader.defineClass(ClassLoader.java:803)
        at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
        at java.net.URLClassLoader.defineClass(URLClassLoader.java:442)
        at java.net.URLClassLoader.access$100(URLClassLoader.java:64)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:354)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:348)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:347)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:312)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
        at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:482)

Can you give an example of setting the classpath?

Depro20 commented 6 years ago

You might want to install java 8 and try again for the Timeout problem. I faced issues while running this : from drqa import tokenizers tok = tokenizers.CoreNLPTokenizer()

from the terminal. This worked fine after I installed java 8.

murphp15 commented 6 years ago

I am also getting a timeout on the preprocess step. When I try run the java command you specified I get a missing class exception but it is nothing to do with standford core nlp. The stacktrace is below. Do you know why I might be getting this issue?

$ java -cp "/Users/paul.murphy/PycharmProjects/DrQA/data/corenlp/*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize [main] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - No tokenizer type provided. Defaulting to PTBTokenizer. [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos [main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [3.8 sec]. [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma [main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner [main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [11.2 sec]. [main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [4.1 sec]. [main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [5.2 sec]. [main] INFO edu.stanford.nlp.time.JollyDayHolidays - Initializing JollyDayHoliday for SUTime from classpath edu/stanford/nlp/models/sutime/jollyday/Holidays_sutime.xml as sutime.binder.1. Exception in thread "main" edu.stanford.nlp.util.ReflectionLoading$ReflectionLoadingException: Error creating edu.stanford.nlp.time.TimeExpressionExtractorImpl at edu.stanford.nlp.util.ReflectionLoading.loadByReflection(ReflectionLoading.java:40) at edu.stanford.nlp.time.TimeExpressionExtractorFactory.create(TimeExpressionExtractorFactory.java:57) at edu.stanford.nlp.time.TimeExpressionExtractorFactory.createExtractor(TimeExpressionExtractorFactory.java:38) at edu.stanford.nlp.ie.regexp.NumberSequenceClassifier.<init>(NumberSequenceClassifier.java:86) at edu.stanford.nlp.ie.NERClassifierCombiner.<init>(NERClassifierCombiner.java:136) at edu.stanford.nlp.pipeline.NERCombinerAnnotator.<init>(NERCombinerAnnotator.java:91) at edu.stanford.nlp.pipeline.AnnotatorImplementations.ner(AnnotatorImplementations.java:70) at edu.stanford.nlp.pipeline.StanfordCoreNLP.lambda$getNamedAnnotators$44(StanfordCoreNLP.java:498) at edu.stanford.nlp.pipeline.StanfordCoreNLP.lambda$getDefaultAnnotatorPool$65(StanfordCoreNLP.java:533) at edu.stanford.nlp.util.Lazy$3.compute(Lazy.java:118) at edu.stanford.nlp.util.Lazy.get(Lazy.java:31) at edu.stanford.nlp.pipeline.AnnotatorPool.get(AnnotatorPool.java:146) at edu.stanford.nlp.pipeline.StanfordCoreNLP.construct(StanfordCoreNLP.java:447) at edu.stanford.nlp.pipeline.StanfordCoreNLP.<init>(StanfordCoreNLP.java:150) at edu.stanford.nlp.pipeline.StanfordCoreNLP.<init>(StanfordCoreNLP.java:146) at edu.stanford.nlp.pipeline.StanfordCoreNLP.<init>(StanfordCoreNLP.java:133) at edu.stanford.nlp.pipeline.StanfordCoreNLP.main(StanfordCoreNLP.java:1344) Caused by: edu.stanford.nlp.util.MetaClass$ClassCreationException: MetaClass couldn't create public edu.stanford.nlp.time.TimeExpressionExtractorImpl(java.lang.String,java.util.Properties) with args [sutime, {}] at edu.stanford.nlp.util.MetaClass$ClassFactory.createInstance(MetaClass.java:237) at edu.stanford.nlp.util.MetaClass.createInstance(MetaClass.java:382) at edu.stanford.nlp.util.ReflectionLoading.loadByReflection(ReflectionLoading.java:38) ... 16 more Caused by: java.lang.reflect.InvocationTargetException at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:488) at edu.stanford.nlp.util.MetaClass$ClassFactory.createInstance(MetaClass.java:233) ... 18 more Caused by: java.lang.NoClassDefFoundError: javax/xml/bind/JAXBException at de.jollyday.util.CalendarUtil.<init>(CalendarUtil.java:42) at de.jollyday.HolidayManager.<init>(HolidayManager.java:66) at de.jollyday.impl.DefaultHolidayManager.<init>(DefaultHolidayManager.java:46) at edu.stanford.nlp.time.JollyDayHolidays$MyXMLManager.<init>(JollyDayHolidays.java:148) at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:488) at java.base/java.lang.Class.newInstance(Class.java:558) at de.jollyday.caching.HolidayManagerValueHandler.instantiateManagerImpl(HolidayManagerValueHandler.java:60) at de.jollyday.caching.HolidayManagerValueHandler.createValue(HolidayManagerValueHandler.java:41) at de.jollyday.caching.HolidayManagerValueHandler.createValue(HolidayManagerValueHandler.java:13) at de.jollyday.util.Cache.get(Cache.java:51) at de.jollyday.HolidayManager.createManager(HolidayManager.java:168) at de.jollyday.HolidayManager.getInstance(HolidayManager.java:148) at edu.stanford.nlp.time.JollyDayHolidays.init(JollyDayHolidays.java:57) at edu.stanford.nlp.time.Options.<init>(Options.java:90) at edu.stanford.nlp.time.TimeExpressionExtractorImpl.init(TimeExpressionExtractorImpl.java:44) at edu.stanford.nlp.time.TimeExpressionExtractorImpl.<init>(TimeExpressionExtractorImpl.java:39) ... 23 more Caused by: java.lang.ClassNotFoundException: javax.xml.bind.JAXBException at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:582) at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:185) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:496) ... 42 more

shubham15098 commented 6 years ago

from drqa import tokenizers tok = tokenizers.CoreNLPTokenizer()

These commands are working fine on terminal. But still I am getting the same error. Can anyone help?

masais1205 commented 5 years ago

from drqa import tokenizers tok = tokenizers.CoreNLPTokenizer()

These commands are working fine on terminal. But still I am getting the same error. Can anyone help?

I have the same issue, "tok = tokenizers.CoreNLPTokenizer()" works, but still could not preprocess my own data