Closed elbamos closed 7 years ago
Hi @elbamos, this looks like you are having an issue with the CoreNLPTokenizer.
echo $CLASSPATH
)java -version
) from drqa.tokenizers import CoreNLPTokenizer
tok = CoreNLPTokenizer()
tok.tokenize('hello world')
In the meantime, you can get around this by using the flag --tokenizer regexp
. This won't guarantee the exact same performance numbers as reported in the README, but should work just fine.
pip install spacy && python -m spacy download en
will also satisfy requirements for the spaCy tokenizer, which you can use. Its a bit faster than the regex one.
Yes, that did it, and spacy's a better tokenizer anyway, thanks.
Very impressive results!
Have you experimented much beyond the training datasets? I wonder how representative they are of the distribution of questions in natural language.
We evaluated on the four different datasets reported in the paper: SQuAD, CuratedTREC, WebQuestions, and WikiMovies. Certainly each of these datasets has its own peculiarities and domains; though together I think they cover a fairly broad spectrum of (factoid) natural language questions.
Still the distribution of possible questions is very large and I'm sure we are not hitting parts of it. Multitasking on more domains will likely help (indeed multitasking on the 4 reported datasets already helps significantly).
If you are like me and the classpath wasn't set because ZSH tried to expand filenames because of the wildcard *
, escape the wildcard character:
export CLASSPATH=$CLASSPATH:/path/to/stanford-corenlp-full-2016-10-31/\*
Needless to say, this fixed the timeouts ⌛️
Hi @elbamos, this looks like you are having an issue with the CoreNLPTokenizer.
- Do you have the corenlp jars in your CLASSPATH? (
echo $CLASSPATH
)- Are you using Java 8? (
java -version
)- Does the following work (should tokenize immediately)?
from drqa.tokenizers import CoreNLPTokenizer tok = CoreNLPTokenizer() tok.tokenize('hello world')
In the meantime, you can get around this by using the flag
--tokenizer regexp
. This won't guarantee the exact same performance numbers as reported in the README, but should work just fine.
Thank you for the guide. By the way, how did you know that the problem comes from java classpath or version? I really curious about it.
I'm getting timeouts when I try to run the pipeline:
Is this a configuration issue? Any suggestions?