Add OpenNLP Analysis capabilities as a module [LUCENE-2899]

asfimport commented 13 years ago

Now that OpenNLP is an ASF project and has a nice license, it would be nice to have a submodule (under analysis) that exposed capabilities for it. Drew Farris, Tom Morton and I have code that does:

Sentence Detection as a Tokenizer (could also be a TokenFilter, although it would have to change slightly to buffer tokens)
NamedEntity recognition as a TokenFilter

We are also planning a Tokenizer/TokenFilter that can put parts of speech as either payloads (PartOfSpeechAttribute?) on a token or at the same position.

I'd propose it go under: modules/analysis/opennlp

Migrated from LUCENE-2899 by Grant Ingersoll (@gsingers), 36 votes, resolved Dec 19 2017 Attachments: LUCENE-2899.patch (versions: 6), LUCENE-2899-6.1.0.patch, LUCENE-2899-RJN.patch, OpenNLPFilter.java, OpenNLPTokenizer.java Linked issues:

SOLR-4793
- SOLR-3623

asfimport commented 11 years ago

Lance Norskog (migrated from JIRA)

I did not make the right changes to OpenNLPFilter.java to handle the API changes. I have attached a fixed version of this to this issue. Please try it and see if it fixes what you see.

A-a-a-a-a-a-n-n-n-n-d chunking is broken. Oy.

asfimport commented 11 years ago

Lance Norskog (migrated from JIRA)

Fixed the Chunker problem. I switched to the new released version of the OpenNLP packages. The MaxEnt implementation (statistical modeling) for chunking changed slightly, and my test data now produces different noun&verb phrase chunks for the sample text.

At this point the only problems I know of are that the licenses are slightly wrong, and so 'ant validate' fails.

These comments only apply to LUCENE-2899-x.patch, which applies to the current 4.x and trunk codelines. LUCENE-2899.patch applies to the release 4.0->4.3 releases. It is not upgraded to the new OpenNLP release.

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Bulk move 4.4 issues to 4.5 and 5.0

asfimport commented 11 years ago

Andrew Janowczyk (migrated from JIRA)

A little bit of a shameless plug, but we just wrote a blog post here about using the stanford library for NER as a processor factory / request handler for Solr. It seems applicable to the audience on this ticket, is it worth contributing it to the community via a patch of some sort?

asfimport commented 11 years ago

Lance Norskog (migrated from JIRA)

Yup! Another NER is always helpful. But the big problem with NLP software is not the code but the models- do you have a good source of free models?

asfimport commented 11 years ago

Jörn Kottmann (migrated from JIRA)

Stanford NLP is licensed under GPLv2, this license is not compatible with the AL 2.0 and therefore such a component can't be contributed to an Apache project directly.

asfimport commented 11 years ago

Andrew Janowczyk (migrated from JIRA)

ahhh thanks for the info. i found a relevant link discussing the licenses which clearly explains the details here. oh well, it was worth a try :)

asfimport commented 11 years ago

Jörn Kottmann (migrated from JIRA)

@[\~lancenorskog] we now have support in OpenNLP to train the name finder on a corpus in the Brat [1] data format, that makes it much easier to annotate custom data within a couple of days/weeks.

[1] http://brat.nlplab.org/

asfimport commented 11 years ago

Lance Norskog (migrated from JIRA)

Wow! Brat looks bitchin! Looking forward to using it.

asfimport commented 10 years ago

rashi gandhi (migrated from JIRA)

Hi, I have applied this patch successfully on SOLR latest branch 4.x. But now I am not getting how to perform contextual searches on the data I have. I need to perform search on text field using some NLP process. I am new to NLP so need some help on how do I proceed further. How to train model using this integrated solr ? Do I need to study some thing else before moving ahead with this ?

I designed a analyzer and tried indexing data. But the results are weird and inconsistent. Kindly provide some pointers to move ahead

Thanks in advance.

asfimport commented 10 years ago

rashi gandhi (migrated from JIRA)

Hi,

I designed an analyzer using OpenNLP filters and indexed some data on it.

My problem is:While searching, SOLR sometimes return result and sometimes not ( but documents are there). for example: if i search for Detail_Nvf:brett ,it returns a document and after sometime again if i fire the same query, it returns Zero document Iam not getting why SOLR results are unstable. Please help me on this.

Thanks in Advance

asfimport commented 10 years ago

Zack Zullick (@zzullick) (migrated from JIRA)

I have seen this behavior before (look up to previous comments, especially from user Em and his previous fix) and I am experiencing similar results with the latest patch uploaded (Jun-16-2013) on 4.4/branch_4x. In my case, the OpenNLP system is only working when indexing the first document then no longer working thereafter. It seems you are having a similar issue, with the exception that yours is happening on the query end rather than the indexing. I sent out an email to Lance to see if he has any advice for us.

asfimport commented 10 years ago

rashi gandhi (migrated from JIRA)

Thanks Zack

Waiting for a reply from Lance :)

asfimport commented 10 years ago

simon raphael (migrated from JIRA)

Hi,

I'm new to Solr and Opennlp. I have followed the tutorial to install this patch. I have downloaded the branch_4x, then i download and apply the LUCENE-2899-current.patch. Then i do "ant compile".

Everything works fine, but no opennlp folder in /solr/contrib/ is created.

What I am doing wrong?

Thanks for your help :)

asfimport commented 10 years ago

Lance Norskog (migrated from JIRA)

Hi-

The latest patch is LUCENE-2899-x.patch, pls try that. Also, apply it with: patch -p0 < patchfile

Lance

asfimport commented 10 years ago

Lance Norskog (migrated from JIRA)

This patch includes a fix for the problem where searching twice doesn't work. The file is LUCENE-2899.patch It has been tested with trunk, branch_4x and the 4.5.1 release.

I do not know of any outstanding issues. To avoid confusion, I have removed all old patches.

asfimport commented 10 years ago

simon raphael (migrated from JIRA)

Hi,

I have a problem after installing the patch. I can't launch Solr anymore. I've got the following error :

Plugin init failure for [schema.xml] analyzer/tokenizer: Error loading class 'solr.OpenNLPTokenizerFactory'

Though the opennlp*.jar files are correctly added :

Adding 'file:/var/www/lucene_solr_4_5_1/solr/contrib/opennlp/lib/opennlp-tools-1.5.3.jar' to classloader 5453 [coreLoadExecutor-3-thread-1] INFO org.apache.solr.core.SolrResourceLoader – Adding 'file:/var/www/lucene_solr_4_5_1/solr/contrib/opennlp/lib/opennlp-maxent-3.0.3.jar' to classloader

Any idea of what I am doing wrong ?

Thank you :)

asfimport commented 10 years ago

Lance Norskog (migrated from JIRA)

The solrconfig.xml file should have these lines in the library set:

<lib dir="../../../contrib/opennlp/lib" regex=".\.jar" /> <lib dir="../../../dist/" regex="solr-opennlp-\d.\.jar" />

Also, you have to copy lucene/build/analysis/opennlp/lucene-analyzers-opennlp*.jar to {{solr/contrib/opennlp/lib/} .

This last problem was a mess. I have not followed these issues: https://issues.apache.org/jira/browse/SOLR-3664, #6313, #6321. I don't know if they handle the problem I described. Shipping this thing as a Lucene/Solr contrib module patch was a mistake- it intersects the build&code structure in too many places.

asfimport commented 10 years ago

Markus Jelsma (migrated from JIRA)

Hi - any change this is going to get committed some day?

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Hi Markus: I haven't looked at this patch. I'll review it now and give my thoughts.

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Just some thoughts:

I think it would be best to split out the different functionality here into subtasks for each piece, and figure out how each should best be integrated.

The current patch does strange things to try to deal with some impedence mismatch due to the design here, such as the tokenfilter which consumes the entire analysis chain and then replays the whole thing back with POS or NER as payloads. Is it really necessary to give this thing more scope than a single setnence? typically such tagging models (at least the ones ive worked with) tend to be trained only within sentence scope.

Also payloads should not be used internally, instead things like TypeAttribute should be used for POSTags, if someone wants to filter out certain POS or maintain certain POS they can use already existing stuff like TypeTokenFilter, if they want to index Type as a payload, they can use TypeAsPayloadTokenFilter, and so on.

While I can see this POS-tagging being useful inside the analysis chain: the NER case is much less clear, I think its more important to e.g. be integrated outside of the analysis chain so that named entities/mentions can be faceted on, added to separate fields for search (likely with a different analysis chain for that), etc. So for lucene that would be an easier way to add these as facets, for solr it probably makes more sense as UpdateProcessor than as analysis chain.

Finally: I'm confused as to what benefit we get from using OpenNLP directly, versus integrating with it via opennlp-uima? Our UIMA integration at various levels (analysis chain/update processor) is already there, so I'm just wondering if thats a much shorter way path.

asfimport commented 10 years ago

Benson Margulies (@bimargulies-google) (migrated from JIRA)

I know of an NER model that looks at the entire text to bias towards consistent tagging of entities in larger units. However, I agree that crocks are bad. Perhaps this is an opportunity to think about how to expand the analysis protocol to support this sort of thing more smoothly?

It would be desirable if this integration were to start with a set of Token Attributes that could be used in any number of analysis components, inside or outside of Lucene, that were in a position to deliver similar items. I suppose I'm late to ask for this, as the UIMA component must pose the same question.

In some languages, NER is very clumsy as a token filter, because entities don't obey token boundaries very well. Also, in my experience, entities aren't useful as additional tokens in the same field as their source text, but rather in their own field (where they can be facetted upon, for example). Is there any appetite to look at Lucene support for a stream that delivers to more than one field? Or is there such a thing and I've missed it?

I agree with Rob about UIMA because I think that Lucene analysis attributes are a weak data model for interconnecting NLP modules and flowing data through them – and one frequently needs to do that.

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I don't think we should expand the analysis protocol: I think its actually already more complicated than it needs to be.

It doesnt need to work across multiple fields or support things like NER.

I know people disagree, but i dont care (typically they dont do a lot of work to maintain this code).

I'll fight it to the death: Lucene's analysis is about doing information retrieval (search and query), and its already overly complex. It should stay per-field, it should stay like a state machine it is.

Stuff like this NER should NOT be in the analysis chain. as i said, its more useful in the "document build" phase anyway.

asfimport commented 10 years ago

Benson Margulies (@bimargulies-google) (migrated from JIRA)

Fair enough. Solr URP's do this very well upstream of analysis. ES doesn't have the concept, perhaps it should. It clarifies the situation nicely to think of Lucene as serial token operations.

asfimport commented 10 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Stuff like this NER should NOT be in the analysis chain. as i said, its more useful in the "document build" phase anyway.

+1

Benson, as far as I understand, ES doesn't have the concept by design.

asfimport commented 10 years ago

Jörn Kottmann (migrated from JIRA)

UIMA based NLP pipelines can use components like Solrcas or Lucas to write their results to an index. This works really well in my experience.

asfimport commented 10 years ago

rashi gandhi (migrated from JIRA)

Hi,

I have successfully applied LUCENE-2899.patch to SOLR-4.5.1 and its working properly. Now , my requirement is to combine OpenNLP with jwnl. Is it possible to combine OpenNLP with jwnl and what are the changes required in SOLR schema.xml for the same? Kindly provide some pointers to move ahead.

Thanks in Advance

asfimport commented 10 years ago

Lance Norskog (migrated from JIRA)

All fair criticisms.

About UIMA: clearly it is much more advanced than this design, but I'm not smart enough to use it :). I've tried to put together something useful (a few times) and each time was completely confused. I learn by example, and the examples are limited. Also there is very little traffic on the mailing lists etc. about UIMA.

About payloads v.s. internal attributes: the examples don't use this feature, but payloads are stored in the index. This supports a question-answering system. Add PERSON payloads with all records, then search for "word X AND 'payload PERSON anywhere'" when someone says "who is X". This does the tagging during indexing, but not searching. A better design would be to add PERSON as a synonym rather than a payload. I also don't see much traffic about payloads.

About doing this in the analysis pipeline v.s. upstream: yes, upstream request processors are the right place for this. In Solr. URPs don't exist in ES or just plain Lucene coding.

asfimport commented 10 years ago

Lance Norskog (migrated from JIRA)

JWNL is WordNet. Lucene has a WordNet parser for use as a synonym filter. http://lucene.apache.org/core/4_0_0/analyzers-common/index.html?org/apache/lucene/analysis/synonym/SynonymMap.html

I don't know how to use this from a Solr filter factory. Please ask this on the Solr mailing list.

asfimport commented 10 years ago

rashi gandhi (migrated from JIRA)

ok, thanks Lance. One more Question I wanted to design an analyzer that can support location containment relationship , For example Europe->France->Paris My requirement is like: when user search for any country , then results must have the documents having that country , as well as the documents having states and cities which comes under that country. But , documents with country name must have high relevancy. It must obeys containment relationship up to 4 levels .i.e. Continent->Country->State->City I wanted to know , is there any way in OpenNLP that can support this type of search. Can location tagger model can be used for this? Please provide me some pointers to move ahead

Thanks in Advance

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Move issue to Lucene 4.9.

asfimport commented 10 years ago

rashi gandhi (migrated from JIRA)

Hi,

I have one running solr core with some data indexed on solr deployed on Tomcat. This core is designed to provide OpenNLP functionalities for indexing and searching. So I have kept following binary models at this location: \apache-tomcat-7.0.53\solr\collection1\conf\opennlp • en-sent.bin • en-token.bin • en-pos-maxent.bin • en-ner-person.bin • en-ner-location.bin

My Problem is: When I unload the running core, and try to delete conf directory from it. It is not allowing me to delete directory with prompt that en-sent.bin and en-token.bin is in use. All other files in conf directory are getting deleted except en-sent.bin and en-token.bin. If I have unloaded core, then why it is not unlocking the connection with core? Is this a known issue with OpenNLP Binaries? How can I release the connection between unloaded core and conf directory. (Specially binary models)

Please provide me some pointers on this. Thanks in Advance

asfimport commented 10 years ago

vivek (migrated from JIRA)

I followed this link to integrate https://wiki.apache.org/solr/OpenNLP to integrate

Installation

For English language testing: Until LUCENE-2899 is committed:

1.pull the latest trunk or 4.0 branch

2.apply the latest LUCENE-2899 patch
3.do 'ant compile'
cd solr/contrib/opennlp/src/test-files/training
.
.
.

i followed first two steps but got the following error while executing 3rd point

common.compile-core: [javac] Compiling 10 source files to /home/biginfolabs/solrtest/solr-lucene-trunk3/lucene/build/analysis/opennlp/classes/java

[javac] warning: [path] bad path element "/home/biginfolabs/solrtest/solr-lucene-trunk3/lucene/analysis/opennlp/lib/jwnl-1.3.3.jar": no such file or directory

[javac] /home/biginfolabs/solrtest/solr-lucene-trunk3/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/FilterPayloadsFilter.java:43: error: cannot find symbol

[javac]     super(Version.LUCENE_44, input);

[javac]                  ^
[javac]   symbol:   variable LUCENE_44
[javac]   location: class Version
[javac] /home/biginfolabs/solrtest/solr-lucene-trunk3/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/OpenNLPTokenizer.java:56: error: no suitable constructor found for Tokenizer(Reader)
[javac]     super(input);
[javac]     ^
[javac]     constructor Tokenizer.Tokenizer(AttributeFactory) is not applicable
[javac]       (actual argument Reader cannot be converted to AttributeFactory by method invocation conversion)
[javac]     constructor Tokenizer.Tokenizer() is not applicable
[javac]       (actual and formal argument lists differ in length)
[javac] 2 errors
[javac] 1 warning

Im really stuck how to passthough this step. I wasted my entire day to fix this but couldn't move a bit. Please someone help me..?

asfimport commented 9 years ago

Sameer Maggon (migrated from JIRA)

@vivek you can change the file and replace the super(Version.LUCENE_44, input) with super(input);

asfimport commented 8 years ago

Alex Watson (migrated from JIRA)

I have tried this patch with the current trunk, it patches fine but when it goes to compile there are a lot of errors, some are easily fixed issues with java 8 being more strict but there are a few which are caused by method signatures being different.

Will there be an updated patch to fix these issues?

What is the status of this being fully integrated in to the trunk so a patch is not required?

asfimport commented 8 years ago

Lance Norskog (migrated from JIRA)

I don't work in this area anymore. Somebody else will have to make an up-to-date patch, and you need to find a committer to be a champion for it.

A tech report of a real-life deployment would be a great way to persuade someone.

asfimport commented 8 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Patch. I took Lance Norskog's latest patch and did the following (among other modernization/cleanup stuff):

Upgraded to the latest OpenNLP release version (1.6.0)
Moved the analysis factories, along with their tests and test data, into the Lucene module at lucene/analysis/opennlp/
Removed the Solr contrib, since it only contained the analysis factories
Added SPI files for the analysis factories
Extended BaseTokenStreamTestCase.assertTokenStreamContents() to test payload contents.
Converted analysis tests to use the above method instead of the home-grown ones in tests in the previous patch.
Converted the test model creation shell script into an Ant target named train-test-models. I've run that target and included the binary models it produced in this patch.
Added IntelliJ and Maven config
Added license and checksum files for the two OpenNLP dependencies
Included a dependency on the Lucene opennlp module in the Solr analysis-extras contrib, so that the module's jar and its dependencies be shipped with the distribution.

All the module's tests pass for me.

I manually tested the Solr integration:

built the distribution and unpacked it
cloned the data driven configs and modified solrconfig.xml to add <lib> elements for the two directories containing the necessary jars
downloaded English binary models from OpenNLP's sourceforge site
via the schema api, added a field type that performs sentence splitting, tokenization and POS tagging, and a field using it
added a simple doc with the opennlp-invoking field via curl and the update/json/docs handler
searched using the admin UI
random text pasted into the Admin UI's analysis pane shows payloads with POS tags (as hex bytes...)

Left to do prior to committability, IMHO, in no particular order:

I think OpenNLPFilter should be broken up into a separate component for each of the things it can do.
I agree with @rmuir that the tagging functionality here should be converted to use token type instead of payloads. Then the included FilterPayloadsFilter won't be necessary (since the TypeTokenFilter has the same functionality for token types), and probably the included StripPayloadsFilter won't be necessary either, since populating payloads would likely only be done as a final step in the analysis chain (e.g. via TypeAsPayloadTokenFilter).
Convert the NER support to be a Solr update processor.

Not sure it should prevent the current state from being committed, but: incorporating SegmentingTokenizerBase (extended by ThaiTokenizer and HMMChineseTokenizer) might be a useful improvement to the sentence-breaking/tokenization strategy currently used by OpenNLPTokenizer.

asfimport commented 8 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Patch, adds OpenNLPLemmatizerFilter, based on OpenNLP's SimpleLemmatizer (new in OpenNLP 1.6.0), which can lemmatize given a dictionary mapping surface form/part-of-speech => lemma.

asfimport commented 8 years ago

Ramesh Kumar Thirumalaisamy (migrated from JIRA)

Glad to see some update on this page. Hope this code gets committed soon. I am trying to use this patch to do some information extraction using SOLR and Open NLP. Can this patch be now used with latest version of SOLR .

asfimport commented 8 years ago

Steven Bower (migrated from JIRA)

@sarowe per our conversation yesterday.. Would be interesting to store the PoS and entity information as stacked tokens vs (or in addition to the) payload... such that you could do "bob@person"\~0 or "house@verb"\~0 type queries.. or things like "@person`@ceo"\~10`

asfimport commented 8 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Can this patch be now used with latest version of SOLR .

The patch named LUCENE-2899.patch is against the (unreleased) master branch of Lucene/Solr.

I tried applying it to the 6.1.0 source code (from a git clone: git checkout releases/lucene-solr/6.1.0), and it failed in a couple places.

So I made a 6.1.0-specific patch named LUCENE-2899-6.1.0.patch and am attaching it to the issue. Compilation and tests succeeded after I ran ant train-test-models from lucene/analysis/opennlp/.

asfimport commented 8 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Steve Rowe per our conversation yesterday.. Would be interesting to store the PoS and entity information as stacked tokens vs (or in addition to the) payload... such that you could do "bob @person"\~0 or "house @verb"\~0 type queries.. or things like "@person @ceo"\~10

Steven Bower, I agree, that possibility would be nice. I checked for the existence of a token type->synonym filter, and don't see one, but I think it would be fairly easy to add.

Which reminds me: the lemmatization filter I added here should have the ability (like some stemmers, indirectly) to emit lemmas as synonyms - this is possible, as in the PorterStemmer implementaiton, simply by not processing any tokens with the keyword attribute set to true, and preceding with the KeywordRepeatFilter.

asfimport commented 8 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Attaching another WIP patch with more progress:

Switched OpenNLPFilter to use TypeAttribute instead of PayloadAttribute to hold annotations from part-of-speech tagging, chunking and NER tagging.
Added a new TypeAsSynonymFilter to the analyzers-common module that adds a token at the same position as a (presumably previously annotated) token, with the value of the TypeAttribute copied into its CharTermAttribute. See Steven Bower's comment above for potential uses.
Removed the now unnecessary FilterPayloadsFilter and StripPayloadFilter that were present in previous iterations of the patch.
Added KeywordAttribute awareness to OpenNLPLemmatizationFilter, so that lemmatization won't be performed on tokens with isKeyword()==true.
Fixed the new payload-aware BaseTokenStreamTestCase.assertTokenStreamContents() to use BytesRef.equals() instead of directly comparing byte arrays and not referencing offset&length.
Added TypeAttribute awareness to CannedTokenStream.

asfimport commented 8 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Patch, only changes are fixes to the opennlp overview javadocs:

refer to TypeAttribute instead of payloads
remove mention of FilterPayloadTokenFilter and StripPayloadsFilter
recommend TypeAsPayloadFilter and TypeAsSynonymFilter to make tags searchable