asu-ke-web-services / search-api

Search API for documents, data, research, people, etc
MIT License
2 stars 1 forks source link

NER tagger needs to account for errors #91

Open rraub opened 8 years ago

rraub commented 8 years ago

On travis the NER tagger is erroring out, but because we are not looking at the errors via:

   $tagger->getErrors();

check out the rraub-ner-tagger-error-catching branch for explanation

rraub commented 8 years ago

https://github.com/gios-asu/search-api/blob/rraub-ner-tagger-error-catching/src/providers/ner-tagger.php#L46-L47

iajohns1 commented 8 years ago

I put the code snippet from the rraub-tagger-error-catching branch and it seems to work in the constructor where the tagger is initialized; however, it seems to count every usage of the tagger as enough of a reason to give an error message, despite there seemingly being no actual error.

commit: https://github.com/gios-asu/search-api/commit/388a0756c02832f55f7a8d896a676731f5ea62b8

kenprice commented 8 years ago

$tagger->getErrors() gets the stderr output for java -mx300m -cp "/usr/local/bin/stanford-ner-2015-04-20/stanford-ner.jar:" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier /usr/local/bin/stanford-ner-2015-04-20/classifiers/english.all.3class.distsim.crf.ser.gz -textFile /tmp/phpnlptagtk2StK -encoding utf8, the command that the PHP wrapper we're using invokes.

phpnlptagtk2StK is a temp file that the plaintext to be tagged is stored.

If I force this command to only output stderr in the terminal, I get this:

batman@epicac2:~/workspace/search-api$ java -mx300m -cp "/usr/local/bin/stanford-ner-2015-04-20/stanford-ner.jar:" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier /usr/local/bin/stanford-ner-2015-04-20/classifiers/english.all.3class.distsim.crf.ser.gz -textFile /tmp/phpnlptagtk2StK -encoding utf8 2>&1 /dev/null
CRFClassifier invoked on Tue Mar 15 15:42:02 MST 2016 with arguments:
   -loadClassifier /usr/local/bin/stanford-ner-2015-04-20/classifiers/english.all.3class.distsim.crf.ser.gz -textFile /tmp/phpnlptagtk2StK -encoding utf8 /dev/null
loadClassifier=/usr/local/bin/stanford-ner-2015-04-20/classifiers/english.all.3class.distsim.crf.ser.gz
encoding=utf8
textFile=/tmp/phpnlptagtk2StK
=/dev/null
Loading classifier from /usr/local/bin/stanford-ner-2015-04-20/classifiers/english.all.3class.distsim.crf.ser.gz ... done [2.9 sec].
term/O term/O term/O 
CRFClassifier tagged 3 words in 1 documents at 43.48 words per second.

So my hunch seems to be correct. stderr is being used to output more than just errors. :P

kenprice commented 8 years ago

Here's a potential workaround if you want to detect an error in the subprocess using $tagger->getError().

Here's a command in the same form that the PHP wrapper (the PHP Stanford NLP lib we're using) builds that causes an error:

batman@epicac2:~/workspace/search-api$ java -mx300m -cp "/usr/local/bin/stanford-ner-2015-04-20/stanford-ner.jar:" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier /usr/local/bin/stanford-ner-2015-04-20/classifiers/english.all.3class.distsim.crf.ser.gz -textFile garbage-in-garbage-out -encoding utf8 2>&1 /dev/null
CRFClassifier invoked on Tue Mar 15 15:51:03 MST 2016 with arguments:
   -loadClassifier /usr/local/bin/stanford-ner-2015-04-20/classifiers/english.all.3class.distsim.crf.ser.gz -textFile garbage-in-garbage-out -encoding utf8 /dev/null
loadClassifier=/usr/local/bin/stanford-ner-2015-04-20/classifiers/english.all.3class.distsim.crf.ser.gz
encoding=utf8
textFile=garbage-in-garbage-out
=/dev/null
Loading classifier from /usr/local/bin/stanford-ner-2015-04-20/classifiers/english.all.3class.distsim.crf.ser.gz ... done [3.0 sec].
Exception in thread "main" edu.stanford.nlp.io.RuntimeIOException: java.io.FileNotFoundException: garbage-in-garbage-out (No such file or directory)
    at edu.stanford.nlp.io.IOUtils.inputStreamFromFile(IOUtils.java:509)
    at edu.stanford.nlp.io.IOUtils.readerFromFile(IOUtils.java:550)
    at edu.stanford.nlp.objectbank.ReaderIteratorFactory$ReaderIterator.setNextObject(ReaderIteratorFactory.java:189)
    at edu.stanford.nlp.objectbank.ReaderIteratorFactory$ReaderIterator.<init>(ReaderIteratorFactory.java:161)
    at edu.stanford.nlp.objectbank.ResettableReaderIteratorFactory.iterator(ResettableReaderIteratorFactory.java:98)
    at edu.stanford.nlp.objectbank.ObjectBank$OBIterator.<init>(ObjectBank.java:414)
    at edu.stanford.nlp.objectbank.ObjectBank.iterator(ObjectBank.java:253)
    at edu.stanford.nlp.sequences.ObjectBankWrapper.iterator(ObjectBankWrapper.java:52)
    at edu.stanford.nlp.ie.AbstractSequenceClassifier.classifyAndWriteAnswers(AbstractSequenceClassifier.java:1160)
    at edu.stanford.nlp.ie.AbstractSequenceClassifier.classifyAndWriteAnswers(AbstractSequenceClassifier.java:1111)
    at edu.stanford.nlp.ie.AbstractSequenceClassifier.classifyAndWriteAnswers(AbstractSequenceClassifier.java:1071)
    at edu.stanford.nlp.ie.AbstractSequenceClassifier.classifyAndWriteAnswers(AbstractSequenceClassifier.java:1052)
    at edu.stanford.nlp.ie.crf.CRFClassifier.main(CRFClassifier.java:3056)
Caused by: java.io.FileNotFoundException: garbage-in-garbage-out (No such file or directory)
    at java.io.FileInputStream.open0(Native Method)
    at java.io.FileInputStream.open(FileInputStream.java:195)
    at java.io.FileInputStream.<init>(FileInputStream.java:138)
    at edu.stanford.nlp.io.IOUtils.inputStreamFromFile(IOUtils.java:503)
    ... 12 more

Sorry for the wall of text. Above error is unlikely for us. But this could happen:

batman@epicac2:~/workspace/search-api$ java -mx300m -cp "/usr/local/bin/stanford-ner-2015-04-20/stanford-ner.jar:" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier /usr/local/bin/stanford-ner-2015-04-20/classifiers/this-classifier-does-not-exist -textFile /tmp/zphpnlptagtk2StK -encoding utf8 2>&1 /dev/null
CRFClassifier invoked on Tue Mar 15 15:51:50 MST 2016 with arguments:
   -loadClassifier /usr/local/bin/stanford-ner-2015-04-20/classifiers/this-classifier-does-not-exist -textFile /tmp/zphpnlptagtk2StK -encoding utf8 /dev/null
loadClassifier=/usr/local/bin/stanford-ner-2015-04-20/classifiers/this-classifier-does-not-exist
encoding=utf8
textFile=/tmp/zphpnlptagtk2StK
=/dev/null
Loading classifier from /usr/local/bin/stanford-ner-2015-04-20/classifiers/this-classifier-does-not-exist ... Error deserializing /usr/local/bin/stanford-ner-2015-04-20/classifiers/this-classifier-does-not-exist
Exception in thread "main" java.lang.RuntimeException: java.io.FileNotFoundException: /usr/local/bin/stanford-ner-2015-04-20/classifiers/this-classifier-does-not-exist (No such file or directory)
    at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifierNoExceptions(AbstractSequenceClassifier.java:1572)
    at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifierNoExceptions(AbstractSequenceClassifier.java:1523)
    at edu.stanford.nlp.ie.crf.CRFClassifier.main(CRFClassifier.java:2987)
Caused by: java.io.FileNotFoundException: /usr/local/bin/stanford-ner-2015-04-20/classifiers/this-classifier-does-not-exist (No such file or directory)
    at java.io.FileInputStream.open0(Native Method)
    at java.io.FileInputStream.open(FileInputStream.java:195)
    at java.io.FileInputStream.<init>(FileInputStream.java:138)
    at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifier(AbstractSequenceClassifier.java:1556)
    at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifierNoExceptions(AbstractSequenceClassifier.java:1569)
    ... 2 more

The error above could be triggered by misconfiguration of our app. For example: config.conf points to a bad directory for our NER jar files/classifiers.

Anyways, I was thinking we just look for the presence of the string "Exception" in $tagger->getError(). This should be sufficient to detect an unrecoverable error. You can go further and extract the rest of the text so we get something like:

Exception in thread "main" java.lang.RuntimeException: java.io.FileNotFoundException: /usr/local/bin/stanford-ner-2015-04-20/classifiers/this-classifier-does-not-exist (No such file or directory)
    at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifierNoExceptions(AbstractSequenceClassifier.java:1572)
    at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifierNoExceptions(AbstractSequenceClassifier.java:1523)
    at edu.stanford.nlp.ie.crf.CRFClassifier.main(CRFClassifier.java:2987)
Caused by: java.io.FileNotFoundException: /usr/local/bin/stanford-ner-2015-04-20/classifiers/this-classifier-does-not-exist (No such file or directory)
    at java.io.FileInputStream.open0(Native Method)
    at java.io.FileInputStream.open(FileInputStream.java:195)
    at java.io.FileInputStream.<init>(FileInputStream.java:138)
    at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifier(AbstractSequenceClassifier.java:1556)
    at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifierNoExceptions(AbstractSequenceClassifier.java:1569)
    ... 2 more

It would be pretty simple to do.

(Edge case warning! Would not work in Chinese)

Edit: Pinging @iajohns1