Add OpenNLP Analysis capabilities as a module [LUCENE-2899] - Githubissues

apache / lucene

Apache Lucene open-source search software

https://lucene.apache.org/

Apache License 2.0

2.56k stars 1k forks source link

Add OpenNLP Analysis capabilities as a module [LUCENE-2899] #3973

Closed asfimport closed 6 years ago

asfimport commented 13 years ago

Now that OpenNLP is an ASF project and has a nice license, it would be nice to have a submodule (under analysis) that exposed capabilities for it. Drew Farris, Tom Morton and I have code that does:

Sentence Detection as a Tokenizer (could also be a TokenFilter, although it would have to change slightly to buffer tokens)
NamedEntity recognition as a TokenFilter

We are also planning a Tokenizer/TokenFilter that can put parts of speech as either payloads (PartOfSpeechAttribute?) on a token or at the same position.

I'd propose it go under: modules/analysis/opennlp

Migrated from LUCENE-2899 by Grant Ingersoll (@gsingers), 36 votes, resolved Dec 19 2017 Attachments: LUCENE-2899.patch (versions: 6), LUCENE-2899-6.1.0.patch, LUCENE-2899-RJN.patch, OpenNLPFilter.java, OpenNLPTokenizer.java Linked issues:

SOLR-4793
- SOLR-3623

asfimport commented 13 years ago

Jörn Kottmann (migrated from JIRA)

The first release is now out. I guess you will use maven for dependency management, you can find here how to add the released version as a dependency: http://incubator.apache.org/opennlp/maven-dependency.html

asfimport commented 12 years ago

Lance Norskog (migrated from JIRA)

This is a patch for the trunk (as of a few days ago) that supplies the OpenNLP Sentence Detector, Tokenizer, Parts-of-Speech, Chunking and Named Entity Recognition tools.

This has nothing to do with the code mentioned above.

asfimport commented 12 years ago

Lance Norskog (migrated from JIRA)

Notes for a Wiki page:

OpenNLP Integration

What is the integration? The first integration is a Tokenizer and three Filters.

The OpenNLPTokenizer uses the OpenNLP SentenceDetector and Tokenizer tools instead of the standard Lucene Tokenizers. This requires statistical model files. One quirk of these is that all punctuation is maintained.
The OpenNLPFilter implements Parts-of-Speech tagging, Chunking (finding noun/verb phrases), and Named Entity Recognition (tagging people, place names etc.). This filter will add all tags as payload attributes to the tokens.
The FilterPayloadsFilter removes tokens by checking the payloads. Given a list of payloads, it will either keep only tokens with one of those payloads, or remove only matching tokens and keep the rest. (This filter maintains position increments correctly.)
The StripPayloadsFilter removes payloads from Tokens.

How do I get going?

pull the latest trunk
apply the patch
download these models to contrib/opennlp/src/test-* files/opennlp/solr/conf/opennlp/
- http://opennlp.sourceforge.net/models-1.5/
- Everything that starts with 'en'
download the OpenNLP distribution from http://opennlp.apache.org/cgi-bin/download.cgi
- Currently it is apache-opennlp-1.5.2-incubating-bin.tar.gz
unpack this and copy the jar files from lib/ to solr/contrib/opennlp/lib

Now, go to trunk-dir/solr and run 'ant test-contrib'. It compiles against the libraries and uses the model files. Next, run 'ant example', cd to the example directory and run 'java -Dsolr.solr.home=opennlp -jar start.jar' You now should start without any Exceptions. At this point, go to the Schema analyzer, pick the 'text_opennlp_pos' field type, and post a sentence or two to the analyzer. You should get text tokenized with payloads. Unfortunately, the analysis page shows them as bytes instead of text. If you would like this, then go vote on https://issues.apache.org/jira/browse/SOLR-3493.

asfimport commented 12 years ago

Lance Norskog (migrated from JIRA)

About the build-

This should be a Lucene module. I got lost trying to make the build work copying jars around, so it will ended up in Solr/contrib.
Downloading the jars. I don't know how to put together license validation with the OpenNLP Maven build. I think it takes some upgrading in the OpenNLP project.
Why download the models from a separate place? The models are not Apache licensed. They are binaries derived from GNU- and otherwise licensed training data. The OpenNLP people archived them on Sourceforge.

asfimport commented 12 years ago

Lance Norskog (migrated from JIRA)

I consider the code and feature set mostly cooked as a first release. The toolkit as is lets you do two things:

Do named entity recognition and filter out names for an autosuggest dictionary
Pick nouns and verbs out of text and only index those. This gives you a field with a smaller more focused set of terms. MoreLikeThis might work better.

Please review it for design, bugs, code nits, whatever.

asfimport commented 12 years ago

Lance Norskog (migrated from JIRA)

An explanation about the OpenNLPUtil factory class: the statistical models are several megabytes apiece. This class loads them and caches them by file name. It does not reload them across commits.

The models are immutable objects. The factory class creates another object that consults the model. There is one of these for each field analysis.

The models are large enough that if the different unit tests load them all at once, it needs more than the default ram. Therefore, the unit tests unload all models between tests, and only run single-threaded.

asfimport commented 12 years ago

Lance Norskog (migrated from JIRA)

License-ready. Ivy-ready. OpenNLP libraries available through Ivy. You still have to download jwnl-1.3.3 from http://sourceforge.net/projects/jwordnet/files/

And of course download the model files. But this is committable to the Solr side.

asfimport commented 12 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Very cool Lance. The models are indeed tricky and I wonder how we can properly hook them into the tests, if at all. I wonder how hard it would be to create much smaller ones based on training just a few things.

asfimport commented 12 years ago

Tommaso Teofili (@tteofili) (migrated from JIRA)

I wonder how hard it would be to create much smaller ones based on training just a few things.

there was the idea of using the OpenNLP CorpusServer with some wikinews articles to train them (back to OPENNLP-385)

asfimport commented 12 years ago

Jörn Kottmann (migrated from JIRA)

I am using this mentioned Corpus Server together with the Apache UIMA Cas Editor for labeling projects. If someone wants to set something up to label data we (OpenNLP people) are happy to help with that!

asfimport commented 12 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Cool!

I think if we could just get a very small model that can be checked in and used for testing purposes, that is all that would be needed. We don't really need to test OpenNLP, we just need to test that the code properly interfaces with OpenNLP, so a really small model should be fine.

asfimport commented 12 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

This really should just be a part of the analysis modules (with the exception of the Solr example parts). I don't know exactly how we are handling Solr examples anymore, but I seem to recall the general consensus was to not proliferate them. Can we just expose the functionality in the main one?

I'll update the patch to move this to the module for starters. Not sure on what to do w/ the example part.

asfimport commented 12 years ago

Jörn Kottmann (migrated from JIRA)

For a test you can run OpenNLP just over a piece of training data, even when trained on a tiny amount of data this will give good results. It does not test OpenNLP, but is sufficient for the desired interface testing.

asfimport commented 12 years ago

Lance Norskog (migrated from JIRA)

This really should just be a part of the analysis modules (with the exception of the Solr example parts). I don't know exactly how we are handling Solr examples anymore, but I seem to recall the general consensus was to not proliferate them. Can we just expose the functionality in the main one?

A lot of Solr/Lucene features are only demoed in solrconfig/schema unit test files (DIH for example). That is fine.

The models are indeed tricky and I wonder how we can properly hook them into the tests, if at all.

D'oh! Forgot about that. If we have tagged data in the project, it helps show the other parts of an NLP suite. It's hard to get a full picture of the jigsaw puzzle if you don't know NLP software.

asfimport commented 12 years ago

Lance Norskog (migrated from JIRA)

Wiki page is up! http://wiki.apache.org/solr/OpenNLP

Also, the Solr fancy toolkits had no links from the Solr front page, so I added 'Advanced Tools' with links to UIMA and this.

asfimport commented 12 years ago

Lance Norskog (migrated from JIRA)

The models are indeed tricky and I wonder how we can properly hook them into the tests, if at all.

I have mini training data for sentence detection, tokenization, POS and chunking. The purpose is to make the matching unit tests pass. The data and build script are in a new (unattached) patch.

NER is proving a tougher nut to crack. I tried annotating several hundred lines of Reuters but no go.

How would I make an NER dataset that will make OpenNLP spit out one or two tags? Is there a large NER dataset that is Apache-friendly?

asfimport commented 12 years ago

Jörn Kottmann (migrated from JIRA)

For NER you should try the perceptron and a cutoff of zero. For NER with a cutoff of 5 you need otherwise much more training data.

asfimport commented 12 years ago

Lance Norskog (migrated from JIRA)

For NER you should try the perceptron and a cutoff of zero.

Thanks! This patch generates all models needed by tests, and the tests are rewritten to use the poor quality data from the models. To make the models, go to solr/contrib/opennlp/src/test-files/training and run bin/training.sh. This populates solr/contrib/opennlp/src/test-files/opennlp/conf/opennlp. I don't have windows anymore so I can't make a .bat version.

asfimport commented 12 years ago

Lance Norskog (migrated from JIRA)

General status:

At this point you have to download 1 library (jwnl) and run a script to make the unit tests work.
You have to download several model files from sourceforge to do real work. There is no script to help.
The tokenizer and filter are in solr/ not lucene/

What is missing to make this a full package:

Payload handling
- TokenFilter to parse TAG/term or term_TAG into term/payload.
- Output code in Solr for the reverse.
- Payload query for tags.
- Similarity scoring algorithms for tags.
Tag handling
- There is a universal set of 12 parts-of-speech tags, with mappings for many language tagsets (Treebank etc.) into 12 common tags. Multi-language sites would benefit from this. I persuaded the authors to switch from GNU to Apache licensing.
- A Universal Part-of-Speech Tagset

What NLP apps would be useful for search? Coordinate expansion, for example.

asfimport commented 12 years ago

Lance Norskog (migrated from JIRA)

This is about finished. The Tokenizer and TokenFilters are moved over into lucene/analysis/opennlp. They do not have unit tests in lucene/ because of the difficulty in supplying model data. They are unit-tested by the factories in solr/contrib/opennlp.

The solr/example/opennlp directory is gone, as per request. Possible field types are documented in the solrconfig.xml in the unit test resources.

All jars are downloaded via ivy. The jwnl library is one rev after what this was compiled with. It is only used in collocation, which is not exposed in this release.

To build, test and commit, there is a boostrap sequence. In the top-level directory:

  ant clean compile

This downloads the OpenNLP jars

cd solr/contrib/opennlp/test-files/training
sh bin/training.sh

This creates low-quality model files in solr/contrib/opennlp/src/test-files/opennlp/solr/collection1/conf/opennlp. In the trunk/solr directory, run


ant example test-contrib

You now have committable binary models. They are small, and only there to run the OpenNLP unit tests. They generate results that are objectively bogus, but the unit tests are matched to the results. If you want real models, you have to download them from sourceforge.

asfimport commented 12 years ago

Lance Norskog (migrated from JIRA)

Oops- remove solr/contrib/opennlp/src/test-files/opennlp/solr/collection1/conf/opennlp/.gitignore. This will prevent you from committing the models.

asfimport commented 12 years ago

Lance Norskog (migrated from JIRA)

dev-tools needs updating. I don't have IntelliJ and don't feel comfortable making the right Eclipse files.

This patch works on both trunk and 4.x. I made a few changes in the build files where modules were out of alphabetic order. Also, the reams of copied code in module-build.xml had blocks out of order. I can't easily see where, but it seems like some of them are missing a few lines that others have.

asfimport commented 12 years ago

Lance Norskog (migrated from JIRA)

The Wiki is updated for testing and committing this patch: http://wiki.apache.org/solr/OpenNLP.

asfimport commented 12 years ago

Lance Norskog (migrated from JIRA)

There is a regression in Solr which causes this to not work in a Solr example: https://issues.apache.org/jira/browse/SOLR-3625. Until this is fixed, you have to copy the Lucene opennlp jar, the Solr opennlp jar, and the solr/contrib/opennlp/lib jars into the solr war.

asfimport commented 12 years ago

Lance Norskog (migrated from JIRA)

https://issues.apache.org/jira/browse/SOLR-3623 should give a final answer for how to build contribs and Lucene libraries and external dependencies. I've found it a little confusing.

asfimport commented 12 years ago

Lance Norskog (migrated from JIRA)

New patch for current build system on trunk & 4.x.

asfimport commented 12 years ago

Lance Norskog (migrated from JIRA)

As it turns out, building is still confused: solr/example/solr-webapps comes and goes.

This build parks the lucene-analyzer-opennlp jar in solr/contrib/opennlp/lucene-libs. example/..../solrconfig.xml includes a reference to ../....../contrib/opennlp/lib and lucene-libs and ../...../dist.

A jar-of-jars or a fully repacked jar in dist/ is the best way to ship this.

Bug status: payloads added by this filter do not get written to the index!

Build-fiddling status: forbidden api checks fail. checksums and licenses validate. rat-sources validate. No dev-tools changes.

If you want this committed, I'm quite happy to do the last mile.

asfimport commented 12 years ago

Alexey Kozhemiakin (migrated from JIRA)

Yes, please, it would be awesome if someone could make this last effort and commit this issue. Many thanks!

asfimport commented 12 years ago

Lance Norskog (migrated from JIRA)

Committable except for dev-tools/ and production builds. I've updated dev-tools/eclipse, I don't have IntelliJ. These dev-tools build files contain 'uima' and so need parallel work for 'opennlp':

dev-tools/maven/lucene/analysis/pom.xml.template
dev-tools/maven/lucene/analysis/uima/pom.xml.template
dev-tools/maven/pom.xml.template
dev-tools/maven/solr/contrib/pom.xml.template
dev-tools/maven/solr/contrib/uima/pom.xml.template
dev-tools/scripts/SOLR-2452.patch.hack.pl
  - this one seems to be dead

asfimport commented 12 years ago

Lance Norskog (migrated from JIRA)

The latest patch is tested fully and painfully in trunk. I'm sure it works as-is in 4.x, but it is not going into 4.0, so I'm not spending time on that

asfimport commented 11 years ago

Em (migrated from JIRA)

Could you please create a new Patch for the current Trunk? I had some problems on applying it to my working copy...

I am not entirely sure whether its the Trunk or your Code, but it seems like your OpenNLP-code only works for the first request.

As far as I was able to debug, the create()-method of the TokenFilterFactory is only called every now and again (are created TokenFilters reused for longer than one call in Solr?).

If create() of your FilterFactory was called, everything works. However if the TokenFilter is somehow reused, it fails.

Is this a bug of Solr or of your Patch?

asfimport commented 11 years ago

Em (migrated from JIRA)

Some Attributes were not reset (i.e. "first"-Attribute in OpenNLPTokenizer and "indexToken" in OpenNLPFilter) correctly.

Since I had trouble applying your patch, I'd like to provide the working source code. Please, create a patch for the current Trunk.

asfimport commented 11 years ago

Lance Norskog (migrated from JIRA)

Thank you!

This worked when I posted it. There have been many changes in 4.x and trunk since then. For example, all of the tokenizer and filter factories moved to Lucene from Solr. I'm waiting until 4.0 is finished before I redo this patch.

asfimport commented 11 years ago

Phani Vempaty (migrated from JIRA)

Would there be a patch for 4.0 as it is released.

asfimport commented 11 years ago

Patricia Gorla (migrated from JIRA)

Thanks for this patch!

I'm able to get the posTagger working, yet I still have not found a way to incorporate either the Chunker or the NER Models into my Solr project.

Setting posTagger by itself works, but when I add a link to the chunkerModel (or even just the chunkerModel by itself), I obtain only the tokenized text.

<fieldType name="text_opennlp_pos" class="solr.TextField" positionIncrementGap="100">
<analyzer>
   <tokenizer class="solr.OpenNLPTokenizerFactory"
      tokenizerModel="opennlp/en-token.bin" />
   <filter class="solr.OpenNLPFilterFactory" 
      chunkerModel="opennlp/en-chunking.bin"/>
</analyzer>
</fieldType>

I'm new to OpenNLP, so any pointers in the right direction would be greatly appreciated.

asfimport commented 11 years ago

Lance Norskog (migrated from JIRA)

Wow, someone tried it! I apologize for not noticing your question.

I'm able to get the posTagger working, yet I still have not found a way to incorporate either the Chunker or the NER Models into my Solr project.

The schema.xml file includes samples for all of the models:

/lusolr_4x_opennlp/solr/contrib/opennlp/src/test-files/opennlp/solr/collection1/conf/schema.xml

This is for the chunker. The chunker works from parts-of-speech tags, not the original words. The chunker needs a parts-of-speech model as well as a chunker model. This should throw an error if the parts-of-speech model is not there. I will fix this.

 <filter class="solr.OpenNLPFilterFactory" 
          posTaggerModel="opennlp/en-test-pos-maxent.bin"
          chunkerModel="opennlp/en-test-chunker.bin"
        />

Is the NER configuration still not working?

asfimport commented 11 years ago

Kai Gülzau (migrated from JIRA)

The patch seems to be a bit out of date. Applying it to branch_4x or trunk fails (build scripts).

asfimport commented 11 years ago

Kai Gülzau (migrated from JIRA)

End of OpenNLPTokenizer.fillBuffer() should be:

while(length == size) {
  offset += size;
  fullText = Arrays.copyOf(fullText, offset + size);
  length = input.read(fullText, offset, size);
}
if (length == -1) {
  length = 0;
}
fullText = Arrays.copyOf(fullText, offset + length);

asfimport commented 11 years ago

Lance Norskog (migrated from JIRA)

Thank you. Have you tried this on the trunk? The Solr components did not work, they could not find the OpenNLP jars.

asfimport commented 11 years ago

Kai Gülzau (migrated from JIRA)

I have applied the Patch to trunk, modified the build scripts manually (ignoring javadoc tasks) and built the opennlp jars. Jars are running in a vanilla Solr 4.1 environment.

solr_server4.1\solr\lib\opennlp\
- -jwnl-1.4_rc3.jar-
- lucene-analyzers-opennlp-5.0-SNAPSHOT.jar (build with patch)
- opennlp-maxent-3.0.2-incubating.jar
- opennlp-tools-1.5.2-incubating.jar
- solr-opennlp-5.0-SNAPSHOT.jar (build with patch)

with <lib dir="../lib/opennlp" /> in solrconfig.xml

Works for me: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201301.mbox/%3CB65DA877C3F93B4FB39EA49A1A03C95CC27AB1%40email.novomind.com%3E

edit: removed jwnl*.jar as stated by Joern

asfimport commented 11 years ago

Jörn Kottmann (migrated from JIRA)

The jwnl library is only needed if you use the OpenNLP Coreference component, otherwise its safe to exclude it. The 1.4_rc3 version is not tested anyway and its likely that the Coreferencer does not probably run with it.

asfimport commented 11 years ago

Rene Nederhand (migrated from JIRA)

New patch for both trunk and 4.1 stable. Tested on revision 1450998.

ant compile
cd solr/contrib/src/test-files/training
sh bin/trainall.sh
cd ../../../../../../solr
ant example test-contrib

Hope this helps more people in testing OpenNLP integration with Solr.

TODO:

Implementing dev-tools
Include references to javadocs

asfimport commented 11 years ago

Rene Nederhand (migrated from JIRA)

New patch for both trunk and 4.1 stable. Tested on revision 1450998.

ant compile
cd solr/contrib/src/test-files/training
sh bin/trainall.sh
cd ../../../../../../solr
ant example test-contrib

Hope this helps more people in testing OpenNLP integration with Solr.

TODO:

Implementing dev-tools
Include references to javadocs

asfimport commented 11 years ago

Maciej Lizewski (migrated from JIRA)

why don't you prepare this as separate project that produces some jars and config files with instructions on how to add it in solr configuration instead of publishing all changes as patches to solr sources? I am interested in doing some tests with your library but setting all things up seems quite complicated and hard to maintain in future... it is just a thought.

asfimport commented 11 years ago

Zack Zullick (@zzullick) (migrated from JIRA)

Some information for those wanting to try this after fighting it for a day: the latest patch posted, LUCENE-2899-RJN.patch for 4.1 does not have Em's OpenNLPFilter.java and OpenNLPTokenizer.java fixed applied. So after applying the patch, make sure to replace those classes with Em's version or the bug that causes the NLP system to only be utilized on the first request will still be present. I was also able to successfully apply this patch to 4.2.1 with minor modification (mostly to the build/ivy xml files).

asfimport commented 11 years ago

Lance Norskog (migrated from JIRA)

Maciej- This is a good point. This package needs changes in a lot of places and it might be easier to package it the way you say.

Zack- The "churn" in the APIs is a major problem in the Lucene code management. The original patch worked in the 4.x branch and trunk when it was posted. What Em fixed is in an area which is very very basic to Lucene. The API changed with no notice and no change in versions or method names.

Everyone- It's great that this has gained some interest. Please create a new master patch with whatever changes are needed for the current code base.

Lucene grand masters- Please don't say "hey kids, write plugins, they're cool!" and then make subtle incompatible changes in APIs.

asfimport commented 11 years ago

Lance Norskog (migrated from JIRA)

I'm updating the patches for 4.x and trunk. Kai's fix works. The unit tests did not attempt to analyse text that is longer than the fixed size temp buffer, and thus the code for copying successive buffers was never exercised. Kai's fix handles this problem. I've added a unit test.

Em: the Lucene Tokenizer lifecyle is that the Tokenizer is created with a Reader, and each call to incrementToken() walks the input. When incrementToken() returns false, that is all- the Tokenizer is finished. TokenStream can support a 'stateful' token stream: with OpenNLPFilter, you call incrementToken() until it returns false, and then you can call 'reset' and it will start over from the beginning. The unit tests include a check that reset() works. The changes you made support a feature that is not supported by Lucene. Also, the changes break most of the unit tests. Please create a unit test that shows the bug, and fix the existing unit tests. No unit test = no bug report.

I'm posting a patch for the current 4.x and trunk. It includes some changes for TokenStream/TokenFilter method signatures, some refactoring in the unit tests, a little tightening in the Tokenizer & Filter, and Kai's fix. There are unit tests for the problem Kai found, and also a test that has TokenizerFactory create multiple Tokenizer streams. If there is a bug in this patch, please write a unit test which demonstrates it.

The patch is called LUCENE-2899-current.patch. It is tested against the current 4.x branch and the current trunk.

Thanks for your interest and hard work- I know it is really tedious to understand this code :)

Lance Norskog

asfimport commented 11 years ago

Lance Norskog (migrated from JIRA)

I found the problem with multiple documents. The API for reusing Tokenizers changed something more sensible, but I only noticed and implemented part of the change. The result was than when you upload multiple documents, it just re-processed the first document.

File LUCENE-2899-x.patch has this fix. It applies against the 4.x branch and the trunk. It does not apply against Lucene 4.0, 4.1, 4.2 or 4.3. For all released Solr versions you want LUCENE-2899.patch from August 27, 2012. There are no new features since that release.

asfimport commented 11 years ago

Jörn Kottmann (migrated from JIRA)

Lance, does the patch gets jwnl form our old SourceForge page? This page is often overloaded and probably makes your build unstable. To solve this issue (see OPENNLP-510) we moved jwnl for 1.5.3 to the central repo. Anyway as long as you don't use the coreference component you can exclude this dependency.

asfimport commented 11 years ago

Lance Norskog (migrated from JIRA)

Yup- upgrading to 1.5.3 is next on the list.

Next