chrismattmann / tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Apache License 2.0
1.51k stars 235 forks source link

Can't get NER models to work with tika-server #310

Closed jjelosua closed 4 years ago

jjelosua commented 4 years ago

Hi Chris,

Maybe this is a bit offtopic but following readme section I am trying to get tika-server to work with OpeNLP but it seems that according to the verbose output it is not taking my classpath into account.

I have first downloaded the models: (faro) ➜ ner git:(ner) ✗ pwd /Users/juan/code/FARO/models/ner (faro) ➜ ner git:(ner) ✗ ls -la total 30360 drwxr-xr-x 5 juan staff 160 May 19 16:28 . drwxr-xr-x 19 juan staff 608 May 19 16:28 .. -rw-r--r-- 1 juan staff 5030307 May 19 16:25 ner-date.bin -rw-r--r-- 1 juan staff 5297172 May 19 16:25 ner-organization.bin -rw-r--r-- 1 juan staff 5207953 May 19 13:16 ner-person.bin

This is an echo of my tika server launch: java -Dlog4j.configuration=file:config/tika/log4j.properties -classpath /Users/juan/code/FARO/models/ner/:/tmp/tika-server.jar org.apache.tika.server.TikaServerCli -h 0.0.0.0 &

These are the logs: [INFO ] 2020-05-19 16:30:30,641 org.apache.tika.server.TikaServerCli - Starting Apache Tika 1.23 server [INFO ] 2020-05-19 16:30:31,049 org.apache.cxf.endpoint.ServerImpl - Setting the server's publish address to be http://0.0.0.0:9998/ [INFO ] 2020-05-19 16:30:31,175 org.eclipse.jetty.util.log - Logging initialized @1444ms to org.eclipse.jetty.util.log.Slf4jLog [INFO ] 2020-05-19 16:30:31,259 org.eclipse.jetty.server.Server - jetty-9.4.21.v20190926; built: 2019-09-26T16:41:09.154Z; git: 72970db61a2904371e1218a95a3bef5d79788c33; jvm 12.0.1+12 [INFO ] 2020-05-19 16:30:31,322 org.eclipse.jetty.server.AbstractConnector - Started ServerConnector@1a20270e{HTTP/1.1,[http/1.1]}{0.0.0.0:9998} [INFO ] 2020-05-19 16:30:31,322 org.eclipse.jetty.server.Server - Started @1595ms [WARN ] 2020-05-19 16:30:31,328 org.eclipse.jetty.server.handler.ContextHandler - Empty contextPath [INFO ] 2020-05-19 16:30:31,343 org.eclipse.jetty.server.handler.ContextHandler - Started o.e.j.s.h.ContextHandler@222eb8aa{/,null,AVAILABLE} [INFO ] 2020-05-19 16:30:31,344 org.apache.tika.server.TikaServerCli - Started Apache Tika server at http://0.0.0.0:9998/ [INFO ] 2020-05-19 16:30:36,319 org.apache.tika.server.resource.RecursiveMetadataResource - rmeta/text (autodetecting type) [INFO ] 2020-05-19 16:30:36,319 org.apache.tika.server.resource.RecursiveMetadataResource - rmeta/text (autodetecting type) [INFO ] 2020-05-19 16:30:36,409 org.apache.tika.parser.ner.NamedEntityParser - going to load, instantiate and bind the instance of org.apache.tika.parser.ner.opennlp.OpenNLPNERecogniser [WARN ] 2020-05-19 16:30:36,413 org.apache.tika.parser.ner.opennlp.OpenNLPNameFinder - Couldn't find model from org/apache/tika/parser/ner/opennlp/ner-location.bin using class loader [INFO ] 2020-05-19 16:30:36,414 org.apache.tika.parser.ner.opennlp.OpenNLPNameFinder - LOCATION NER : Available for service ? false [WARN ] 2020-05-19 16:30:36,414 org.apache.tika.parser.ner.opennlp.OpenNLPNameFinder - Couldn't find model from org/apache/tika/parser/ner/opennlp/ner-organization.bin using class loader [INFO ] 2020-05-19 16:30:36,414 org.apache.tika.parser.ner.opennlp.OpenNLPNameFinder - ORGANIZATION NER : Available for service ? false [WARN ] 2020-05-19 16:30:36,414 org.apache.tika.parser.ner.opennlp.OpenNLPNameFinder - Couldn't find model from org/apache/tika/parser/ner/opennlp/ner-date.bin using class loader [INFO ] 2020-05-19 16:30:36,414 org.apache.tika.parser.ner.opennlp.OpenNLPNameFinder - DATE NER : Available for service ? false [WARN ] 2020-05-19 16:30:36,415 org.apache.tika.parser.ner.opennlp.OpenNLPNameFinder - Couldn't find model from org/apache/tika/parser/ner/opennlp/ner-money.bin using class loader [INFO ] 2020-05-19 16:30:36,415 org.apache.tika.parser.ner.opennlp.OpenNLPNameFinder - MONEY NER : Available for service ? false [WARN ] 2020-05-19 16:30:36,415 org.apache.tika.parser.ner.opennlp.OpenNLPNameFinder - Couldn't find model from org/apache/tika/parser/ner/opennlp/ner-person.bin using class loader [INFO ] 2020-05-19 16:30:36,415 org.apache.tika.parser.ner.opennlp.OpenNLPNameFinder - PERSON NER : Available for service ? false [WARN ] 2020-05-19 16:30:36,415 org.apache.tika.parser.ner.opennlp.OpenNLPNameFinder - Couldn't find model from org/apache/tika/parser/ner/opennlp/ner-percentage.bin using class loader [INFO ] 2020-05-19 16:30:36,415 org.apache.tika.parser.ner.opennlp.OpenNLPNameFinder - PERCENT NER : Available for service ? false [WARN ] 2020-05-19 16:30:36,415 org.apache.tika.parser.ner.opennlp.OpenNLPNameFinder - Couldn't find model from org/apache/tika/parser/ner/opennlp/ner-time.bin using class loader [INFO ] 2020-05-19 16:30:36,415 org.apache.tika.parser.ner.opennlp.OpenNLPNameFinder - TIME NER : Available for service ? false [INFO ] 2020-05-19 16:30:36,415 org.apache.tika.parser.ner.NamedEntityParser - org.apache.tika.parser.ner.opennlp.OpenNLPNERecogniser is available ? false [INFO ] 2020-05-19 16:30:36,415 org.apache.tika.parser.ner.NamedEntityParser - going to load, instantiate and bind the instance of org.apache.tika.parser.ner.regex.RegexNERecogniser [INFO ] 2020-05-19 16:30:36,452 org.apache.tika.parser.ner.NamedEntityParser - org.apache.tika.parser.ner.regex.RegexNERecogniser is available ? false [INFO ] 2020-05-19 16:30:36,480 org.apache.tika.parser.ner.NamedEntityParser - Number of NERecognisers in chain 0

Any ideas on what's missing?

Thanks

chrismattmann commented 4 years ago

hey @jjelosua what is in the directory: /Users/juan/code/FARO/models/ner/? Would it happen to be org/apache/tika/parser/ner/opennlp/*.bin? If not, you need those model files on the classpath somewhere. See where it says NER : Available for service ? false that tells you NER isn't on since it couldn't find the model. Double check the models are on the classpath...

jjelosua commented 4 years ago

Hi @chrismattmann, yeah the models are there....

(faro) ➜ ner git:(ner) ✗ pwd /Users/juan/code/FARO/models/ner (faro) ➜ ner git:(ner) ✗ ls -la total 30360 drwxr-xr-x 5 juan staff 160 May 19 16:28 . drwxr-xr-x 19 juan staff 608 May 19 16:28 .. -rw-r--r-- 1 juan staff 5030307 May 19 16:25 ner-date.bin -rw-r--r-- 1 juan staff 5297172 May 19 16:25 ner-organization.bin -rw-r--r-- 1 juan staff 5207953 May 19 13:16 ner-person.bin

jjelosua commented 4 years ago

Do I need to recreate the org/apache/tika/parser/ner/opennlp/ path inside mine? I do not think so...right?

jjelosua commented 4 years ago

@chrismattmann, Recreating org/apache/tika/parser/ner/opennlp/*.bin inside /Users/juan/code/FARO/models/ner/ worked....who knew ;-)

Thanks and sorry for the offtopic

Cheers

chrismattmann commented 4 years ago

no worries at all! It has to do with classpath loading issues in Java :) If you think there should be doc updates in README.md send them my way for a PR! 👍