asu-ke-web-services / search-api

Search API for documents, data, research, people, etc
MIT License
2 stars 1 forks source link

Stanford NER Tagger fails when specifying certain directories. #66

Closed kenprice closed 8 years ago

kenprice commented 8 years ago

I am currently working on having the path to the NER Tagger / classifier be configurable. I encountered an interesting problem. In the case where a StanfordNLP\NERTagger is instantiated with ./lib/stanford-ner-2015-04-20/... as the root of the .jar and classifier files, the tests for our NER Tagger fails. For /usr/local/bin/stanford-ner-2015-04-20/..., the tests pass. I'm looking into the cause.

php-stanford-nlp spawns a child process. I ran the command manually in a terminal and got this:

batman@epicac2:~/workspace/search-api$ java -mx300m -cp "./lib/stanford-ner-2015-04-20/stanford-ner.jar:" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier ./lib/stanford-ner-2015-04-20/classifiers/english.all.3class.distsim.crf.ser.gz -textFile /tmp/phpnlptagnL83bI -encoding utf8
CRFClassifier invoked on Fri Jan 29 18:52:53 MST 2016 with arguments:
   -loadClassifier ./lib/stanford-ner-2015-04-20/classifiers/english.all.3class.distsim.crf.ser.gz -textFile /tmp/phpnlptagnL83bI -encoding utf8
loadClassifier=./lib/stanford-ner-2015-04-20/classifiers/english.all.3class.distsim.crf.ser.gz
encoding=utf8
textFile=/tmp/phpnlptagnL83bI
Loading classifier from ./lib/stanford-ner-2015-04-20/classifiers/english.all.3class.distsim.crf.ser.gz ... done [5.1 sec].
sample/O request/O phrase/O 
CRFClassifier tagged 3 words in 1 documents at 4.10 words per second.
batman@epicac2:~/workspace/search-api$ java -mx300m -cp "/usr/local/bin/stanford-ner-2015-04-20/stanford-ner.jar:" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier /usr/local/bin/stanford-ner-2015-04-20/classifiers/english.all.3class.distsim.crf.ser.gz -textFile /tmp/phpnlptagcZIiTV -encoding utf8
CRFClassifier invoked on Fri Jan 29 18:53:53 MST 2016 with arguments:
   -loadClassifier /usr/local/bin/stanford-ner-2015-04-20/classifiers/english.all.3class.distsim.crf.ser.gz -textFile /tmp/phpnlptagcZIiTV -encoding utf8
loadClassifier=/usr/local/bin/stanford-ner-2015-04-20/classifiers/english.all.3class.distsim.crf.ser.gz
encoding=utf8
textFile=/tmp/phpnlptagcZIiTV
Loading classifier from /usr/local/bin/stanford-ner-2015-04-20/classifiers/english.all.3class.distsim.crf.ser.gz ... done [2.9 sec].
sample/O request/O phrase/O 
CRFClassifier tagged 3 words in 1 documents at 46.15 words per second.

Ostensibly, both cases should work fine. I suspect it may have to do with the pipes used to communicate between processes.

kenprice commented 8 years ago

I opened this issue as my "notes", and in case anyone else wants to chime in. But I figured it out. To make php-stanford-nlp play nice with the Stanford NER Tagger, we need to use absolute paths. Closing.