Can't generate datasets for distant supervision

ironflood commented 6 years ago

Hello,

I can't manage to generate the datasets for DS, no matter the tokenizer used. When attempting with '--tokenizer spacy' the script never goes beyond the line 197 of generate.py q_tokens = workers.map(tokenize_text, questions)

02/15/2018 04:34:31 PM: [ Processing 3778 question answer pairs... ]
02/15/2018 04:34:31 PM: [ Will save to data/ds/WebQuestions-train.dstrain and data/ds/WebQuestions-train.dsdev ]
02/15/2018 04:34:31 PM: [ Loading data/wikipedia/docs-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz ]
02/15/2018 04:39:45 PM: [ Ranking documents (top 5 per question)... ]
02/15/2018 04:42:42 PM: [ Pre-tokenizing questions... ]

When using another tokenizer, like '--tokenizer simple' I get the following errors:

02/16/2018 12:17:18 PM: [ Processing 3778 question answer pairs... ]
02/16/2018 12:17:18 PM: [ Will save to data/ds/WebQuestions-train.dstrain and data/ds/WebQuestions-train.dsdev ]
02/16/2018 12:17:18 PM: [ Loading data/wikipedia/docs-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz ]
02/16/2018 12:22:33 PM: [ Ranking documents (top 5 per question)... ]
02/16/2018 12:25:26 PM: [ Pre-tokenizing questions... ]
02/16/2018 12:25:26 PM: [ SimpleTokenizer only tokenizes! Skipping annotators: {'ner'} ]
02/16/2018 12:25:26 PM: [ SimpleTokenizer only tokenizes! Skipping annotators: {'ner'} ]
02/16/2018 12:25:26 PM: [ SimpleTokenizer only tokenizes! Skipping annotators: {'ner'} ]
02/16/2018 12:25:26 PM: [ SimpleTokenizer only tokenizes! Skipping annotators: {'ner'} ]
02/16/2018 12:25:26 PM: [ SimpleTokenizer only tokenizes! Skipping annotators: {'ner'} ]
02/16/2018 12:25:26 PM: [ SimpleTokenizer only tokenizes! Skipping annotators: {'ner'} ]
02/16/2018 12:25:26 PM: [ SimpleTokenizer only tokenizes! Skipping annotators: {'ner'} ]
02/16/2018 12:25:26 PM: [ SimpleTokenizer only tokenizes! Skipping annotators: {'ner'} ]
02/16/2018 12:25:36 PM: [ Searching documents... ]
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.6.4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "scripts/distant/generate.py", line 170, in search_docs
    found = find_answer(paragraph, q_tokens, answer, opts)
  File "scripts/distant/generate.py", line 109, in find_answer
    for ne in q_tokens.entity_groups():
TypeError: 'NoneType' object is not iterable
"""

Any idea of what might be happening would be greatly appreciated )

ajfisch commented 6 years ago

Does htop show activity, or is something stuck?

ironflood commented 6 years ago

When reaching the step "Pre-tokenizing questions" the python3 CPU usage drops from ~270% to 0.1% CPU. So yes it seems stuck.

wasiahmad commented 6 years ago

Is it possible to share the distant supervised data? I have tried several times but failed to generate the dataset because of exceptions and the process keeps running forever. It would be a big help if you can share the data directly.

ajfisch commented 6 years ago

What version of CoreNLP are you using? The latest versions appear to load large NER lists, which is causing some errors. Using CoreNLP 3.8.0 (the one specified in the install_corenlp.sh) works for me.

wasiahmad commented 6 years ago

Yes, I tried with the install_corenlp.sh and then tried the most recent version, none of them worked. I installed CoreNLP and checked if it works. It worked but whenever I run the distant supervision generate.py file, it gives the same exceptions as mention here. Isn't it possible to host this distant supervised data somewhere? It would be very convenient for us.

ajfisch commented 6 years ago

What version of pexpect are you using? 4.2.1?

wasiahmad commented 6 years ago

I am not sure about the version. Here is the log:

Traceback (most recent call last):
  File "/if5/wua4nw/anaconda3.6/lib/python3.6/site-packages/pexpect/expect.py", line 96, in expect_loop
    incoming = spawn.read_nonblocking(spawn.maxread, timeout)
  File "/if5/wua4nw/anaconda3.6/lib/python3.6/site-packages/pexpect/pty_spawn.py", line 466, in read_nonblocking
    raise TIMEOUT('Timeout exceeded.')
pexpect.exceptions.TIMEOUT: Timeout exceeded.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/if5/wua4nw/anaconda3.6/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/if5/wua4nw/anaconda3.6/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/if5/wua4nw/anaconda3.6/lib/python3.6/multiprocessing/pool.py", line 103, in worker
    initializer(*initargs)
  File "generate.py", line 48, in init
    PROCESS_TOK = tokenizer_class(**tokenizer_opts)
  File "/net/if5/wua4nw/open_domain_qa/DrQA/drqa/tokenizers/corenlp_tokenizer.py", line 33, in __init__
    self._launch()
  File "/net/if5/wua4nw/open_domain_qa/DrQA/drqa/tokenizers/corenlp_tokenizer.py", line 61, in _launch
    self.corenlp.expect_exact('NLP>', searchwindowsize=100)
  File "/if5/wua4nw/anaconda3.6/lib/python3.6/site-packages/pexpect/spawnbase.py", line 404, in expect_exact
    return exp.expect_loop(timeout)
  File "/if5/wua4nw/anaconda3.6/lib/python3.6/site-packages/pexpect/expect.py", line 104, in expect_loop
    return self.timeout(e)
  File "/if5/wua4nw/anaconda3.6/lib/python3.6/site-packages/pexpect/expect.py", line 68, in timeout
    raise TIMEOUT(msg)
pexpect.exceptions.TIMEOUT: Timeout exceeded.
<pexpect.pty_spawn.spawn object at 0x7f863384c908>
command: /bin/bash
args: ['/bin/bash']
buffer (last 100 chars): b' class edu.stanford.nlp.pipeline.StanfordCoreNLP\r\nwua4nw@nlp:~/open_domain_qa/DrQA/scripts/distant$ '
before (last 100 chars): b' class edu.stanford.nlp.pipeline.StanfordCoreNLP\r\nwua4nw@nlp:~/open_domain_qa/DrQA/scripts/distant$ '
after: <class 'pexpect.exceptions.TIMEOUT'>
match: None
match_index: None
exitstatus: None
flag_eof: False
pid: 10807
child_fd: 18
closed: False
timeout: 60
delimiter: <class 'pexpect.exceptions.EOF'>
logfile: None
logfile_read: None
logfile_send: None
maxread: 100000
ignorecase: False
searchwindowsize: None
delaybeforesend: 0
delayafterclose: 0.1
delayafterterminate: 0.1
searcher: searcher_string:
    0: "b'NLP>'"
Process ForkPoolWorker-10:
Traceback (most recent call last):
  File "/if5/wua4nw/anaconda3.6/lib/python3.6/site-packages/pexpect/expect.py", line 96, in expect_loop
    incoming = spawn.read_nonblocking(spawn.maxread, timeout)
  File "/if5/wua4nw/anaconda3.6/lib/python3.6/site-packages/pexpect/pty_spawn.py", line 466, in read_nonblocking
    raise TIMEOUT('Timeout exceeded.')
pexpect.exceptions.TIMEOUT: Timeout exceeded.

And such messages keep coming one after another.

ajfisch commented 6 years ago

Please check it by running:

import pexpect
pexpect.__version__

I am generating the files on my own for you, but I would like to try to resolve this error.

wasiahmad commented 6 years ago

Thanks, I checked. It is 4.3.1.

ajfisch commented 6 years ago

Please downgrade to 4.2.1 by running pip install pexpect==4.2.1 and let me know if the error persists

wasiahmad commented 6 years ago

I tried to install version 4.2.1 but getting this message:

Cannot uninstall 'pexpect'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.

I looked for solution in web but couldn't find anything reasonable.

ajfisch commented 6 years ago

In the meantime, I am hosting a generated dataset at http://people.csail.mit.edu/fisch/assets/data/drqa/distant.tar.gz.

facebookresearch / DrQA

Can't generate datasets for distant supervision #94