apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.59k stars 1.01k forks source link

Add OpenNLP Analysis capabilities as a module [LUCENE-2899] #3973

Closed asfimport closed 6 years ago

asfimport commented 13 years ago

Now that OpenNLP is an ASF project and has a nice license, it would be nice to have a submodule (under analysis) that exposed capabilities for it. Drew Farris, Tom Morton and I have code that does:

We are also planning a Tokenizer/TokenFilter that can put parts of speech as either payloads (PartOfSpeechAttribute?) on a token or at the same position.

I'd propose it go under: modules/analysis/opennlp


Migrated from LUCENE-2899 by Grant Ingersoll (@gsingers), 36 votes, resolved Dec 19 2017 Attachments: LUCENE-2899.patch (versions: 6), LUCENE-2899-6.1.0.patch, LUCENE-2899-RJN.patch, OpenNLPFilter.java, OpenNLPTokenizer.java Linked issues:

asfimport commented 7 years ago

Lance Norskog (migrated from JIRA)

I don't remember if it's always or just seldom. It was just something I noticed when testing them. I'm not an NLP researcher, and I've been out of the Solr world for years. It sounds like Joern Kottman knows his way around this stuff.

asfimport commented 6 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Patch, lots of changes (see below), I think it's ready to go (precommit and all Lucene/Solr tests pass). My plan is to wait a couple days for review, then commit if there are no objections.

Changes since the last patch:

asfimport commented 6 years ago

Tommaso Teofili (@tteofili) (migrated from JIRA)

looks good to me, thanks Steve!

asfimport commented 6 years ago

ASF subversion and git services (migrated from JIRA)

Commit b720e1ee3a524034fb8a8a6188b0b23bf17ff1cb in lucene-solr's branch refs/heads/branch_7x from @sarowe https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b720e1e

LUCENE-2899: Add OpenNLP Analysis capabilities as a module

asfimport commented 6 years ago

ASF subversion and git services (migrated from JIRA)

Commit 3e2f9e62d772218bf1fcae6d58542fad3ec43742 in lucene-solr's branch refs/heads/master from @sarowe https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3e2f9e6

LUCENE-2899: Add OpenNLP Analysis capabilities as a module

asfimport commented 6 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Committed to master and branch_7x. Thanks everybody!

asfimport commented 6 years ago

ASF subversion and git services (migrated from JIRA)

Commit 464199293d169d7d2096f0428792d72a722cc927 in lucene-solr's branch refs/heads/branch_7x from @sarowe https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4641992

LUCENE-2899: Fix hyperlink

asfimport commented 6 years ago

ASF subversion and git services (migrated from JIRA)

Commit f5c4276163d1211f33dc0f27e947e7dc78aa0444 in lucene-solr's branch refs/heads/master from @sarowe https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f5c4276

LUCENE-2899: Fix hyperlink

asfimport commented 6 years ago

ASF subversion and git services (migrated from JIRA)

Commit 46fa2e45f72a83bde83e2773a6e24673c73c7505 in lucene-solr's branch refs/heads/branch_7x from @sarowe https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=46fa2e4

LUCENE-2899: Fix hyperlink text

asfimport commented 6 years ago

ASF subversion and git services (migrated from JIRA)

Commit 565d13c96d89064214f74a81739eaf6b9fb7be18 in lucene-solr's branch refs/heads/master from @sarowe https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=565d13c

LUCENE-2899: Fix hyperlink text

asfimport commented 6 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

My Jenkins found a reproducing seed on master for a TestOpenNLPPOSFilterFactory.testPos() failure:

  [junit4] Suite: org.apache.lucene.analysis.opennlp.TestOpenNLPPOSFilterFactory
  [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestOpenNLPPOSFilterFactory -Dtests.method=testPOS -Dtests.seed=9CB6DAAD9AB4A5C3 -Dtests.slow=true -Dtests.locale=lv-LV -Dtests.timezone=America/La_Paz -Dtests.asserts=true -Dtests.file.encoding=UTF-8
  [junit4] FAILURE 0.02s J2 | TestOpenNLPPOSFilterFactory.testPOS <<<
  [junit4]    > Throwable #1: org.junit.ComparisonFailure: term 0 expected:<[Sentence]> but was:<[2]>
  [junit4]    >     at __randomizedtesting.SeedInfo.seed([9CB6DAAD9AB4A5C3:E04A51BF1852C7E]:0)
  [junit4]    >     at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:201)
  [junit4]    >     at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:325)
  [junit4]    >     at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:329)
  [junit4]    >     at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:865)
  [junit4]    >     at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:727)
  [junit4]    >     at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertAnalyzesTo(BaseTokenStreamTestCase.java:390)
  [junit4]    >     at org.apache.lucene.analysis.opennlp.TestOpenNLPPOSFilterFactory.testPOS(TestOpenNLPPOSFilterFactory.java:75)
  [junit4]    >     at java.lang.Thread.run(Thread.java:748)
  [junit4]   2> NOTE: test params are: codec=Asserting(Lucene70), sim=Asserting(org.apache.lucene.search.similarities.AssertingSimilarity@4f758cba), locale=lv-LV, timezone=America/La_Paz
  [junit4]   2> NOTE: Linux 4.1.0-custom2-amd64 amd64/Oracle Corporation 1.8.0_151 (64-bit)/cpus=16,threads=1,free=412573592,total=514850816
  [junit4]   2> NOTE: All tests run in this JVM: [TestOpenNLPPOSFilterFactory]
asfimport commented 6 years ago

ASF subversion and git services (migrated from JIRA)

Commit 40ed963b362281c67b37efaca51a3dafca00762a in lucene-solr's branch refs/heads/branch_7x from @sarowe https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=40ed963

LUCENE-2899: tests: remove unused constants

asfimport commented 6 years ago

ASF subversion and git services (migrated from JIRA)

Commit 42239ea51d9b1c34f20d273ab06821e22b421e54 in lucene-solr's branch refs/heads/branch_7x from @sarowe https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=42239ea

LUCENE-2899: OpenNLPPOSFilter: fix reset() to fully reset

asfimport commented 6 years ago

ASF subversion and git services (migrated from JIRA)

Commit 827595233751d97d8a2408e69be5dbaf004c7d55 in lucene-solr's branch refs/heads/master from @sarowe https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8275952

LUCENE-2899: tests: remove unused constants

asfimport commented 6 years ago

ASF subversion and git services (migrated from JIRA)

Commit f8fb13965612142e9ee91631c6ce80a7b255e348 in lucene-solr's branch refs/heads/master from @sarowe https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f8fb139

LUCENE-2899: OpenNLPPOSFilter: fix reset() to fully reset

asfimport commented 6 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

My Jenkins found a reproducing seed on master for a TestOpenNLPPOSFilterFactory.testPos() failure

I committed a fix for this and all other Jenkins failures I could find for this test suite: OpenNLPPosFilter.reset() wasn't working properly, resulting in state being inappropriately carried over from previous invocations.

asfimport commented 6 years ago

Alexey Ponomarenko (migrated from JIRA)

Hi, do you plan to write documentation how to work with this feature?

I have tried to install this and get it work on SolrCloud but I have no luck.

There no much answers or info on stackoverflow so it will be nice to have docs to start using this feature! 

asfimport commented 6 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Hi, do you plan to write documentation how to work with this feature? I have tried to install this and get it work on SolrCloud but I have no luck.

This feature has not yet been released - Lucene/Solr 7.3 will include it when it's released, which will (very likely) happen today.

The 7.3 Solr reference guide, which is already online, includes some docs for the language analysis features added under this issue: http://lucene.apache.org/solr/guide/7_3/language-analysis.html#opennlp-integration, Here is the Solr 7.3 javadoc, also already online, for the NER update request processor: https://lucene.apache.org/solr/7_3_0/solr-analysis-extras/org/apache/solr/update/processor/OpenNLPExtractNamedEntitiesUpdateProcessorFactory.html

asfimport commented 6 years ago

Alexey Ponomarenko (migrated from JIRA)

Thanks, but how can I upload model files to the Zookeeper?

I have asked similar question to SO but nobody can answer me https://stackoverflow.com/questions/49515397/upload-filebinary-into-zookeeper-solrcloud

Without uploading model files I can not use OpenNLP this is a crucial point in installation. 

asfimport commented 6 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

how can I upload model files to the Zookeeper?

Have you seen https://lucene.apache.org/solr/guide/7_3/using-zookeeper-to-manage-configuration-files.html and https://lucene.apache.org/solr/guide/7_3/command-line-utilities.html ? In particular, from https://lucene.apache.org/solr/guide/7_3/command-line-utilities.html#put-a-local-file-into-a-new-zookeeper-file:

./server/scripts/cloud-scripts/zkcli.sh -zkhost 127.0.0.1:9983 -cmd putfile /my_zk_file.txt /tmp/my_local_file.txt
asfimport commented 6 years ago

Alexey Ponomarenko (migrated from JIRA)

Thanks, I saw this but I will try. But the point is that I am using windows and I am trying to use solr zk command but as far as I see this is wrong direction. 

Anyway I will try server/scripts/cloud-scripts/zkcli.bat .

Thanks a lot for pointing me into right direction. 

asfimport commented 6 years ago

Lance Norskog (migrated from JIRA)

The last time I read up on ZK, files are limited to 1mb. The ZK "file system" is intended for small configuration files. NLP models can be many megabytes. You might need an alternate path (scp) to distribute NLP models. On Windows, SMB file sharing.

asfimport commented 6 years ago

Lance Norskog (migrated from JIRA)

I'm so cheered up that @sarowe picked this up and added it to Solr!

asfimport commented 6 years ago

Alexey Ponomarenko (migrated from JIRA)

Lance Norskog Thanks for advise. so I need to use SCP or SMB protocol to point to this model files somewhere in network? 

What are network protocols supported by SolrCloud? 

I don't event know that SolrCloud can download model files somewhere from network. 

asfimport commented 6 years ago

Lance Norskog (migrated from JIRA)

No, I think you may need to copy the model files to the right directory on each SolrCloud server via your own custom script. Or, have the files on a network share and then mount that share on each SolrCloud server, using the same letter on all servers.

On Thu, Apr 5, 2018 at 1:31 AM, Alexey Ponomarenko (JIRA) <jira@apache.org>

– Lance Norskog lance.norskog@gmail.com Redwood City, CA

asfimport commented 6 years ago

Alexey Ponomarenko (migrated from JIRA)

Lance Norskog No, I think you may need to copy the model files to the right directory on - But what is this directory?

I have tried some directories but I have no luck.

asfimport commented 6 years ago

Alexey Ponomarenko (migrated from JIRA)

Lance Norskog One more question. How can I use SMB and\or scp with SolrCloud correclty?

Event if I use someting like this: 

// smb://DESKTOP-LMQI80K/opennlp/en-tokenizer.bin or \\DESKTOP-LMQI80K/opennlp/en-tokenizer.bin or file://DESKTOP-LMQI80K/opennlp/en-pos-maxent.bin

Solr is throwing strange error:

org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Could not load conf for core numberplate_shard2_replica_n6: Can't load schema managed-schema: java.io.IOException: Error opening /configs/numberplate/smb://DESKTOP-LMQI80K/opennlp/en-pos-maxent.bin

It seems that it "want to find" files inside of Zookeeper.

asfimport commented 6 years ago

Alexey Ponomarenko (migrated from JIRA)

BTW here is part of my management-schema config:

    <fieldType name="text_opennlp_pos" class="solr.TextField">
      <analyzer>
       <tokenizer class="solr.OpenNLPTokenizerFactory"
             sentenceModel="opennlp/en-sent.bin"
             tokenizerModel="opennlp/en-token.bin"/>
        <filter class="solr.OpenNLPPOSFilterFactory" posTaggerModel="smb://DESKTOP-LMQI80K/opennlp/en-pos-maxent.bin"/>
        <filter class="solr.TypeAsPayloadFilterFactory"/>
       <!-- <filter class="solr.TypeTokenFilterFactory" types="stop.pos.txt"/> -->
      </analyzer>
    </fieldType>
asfimport commented 6 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Note the workaround on SOLR-4793 for ZK resources larger than 1M; from the ZK admin manual:

Unsafe Options

The following options can be useful, but be careful when you use them. The risk of each is explained along with the explanation of what the variable does.

[...]

jute.maxbuffer: (Java system property: jute.maxbuffer) This option can only be set as a Java system property. There is no zookeeper prefix on it. It specifies the maximum size of the data that can be stored in a znode. The default is 0xfffff, or just under 1M. If this option is changed, the system property must be set on all servers and clients otherwise problems will arise. This is really a sanity check. ZooKeeper is designed to store data on the order of kilobytes in size.

 This is spelled out a little more here: https://www.shi-gmbh.com/tutorials/increase-file-size-zookeeper/

asfimport commented 6 years ago

Alexey Ponomarenko (migrated from JIRA)

@sarowe Thanks for pointing me into right direction. But maybe you know about putting model files somewhere in network. This is my prev question. Maybe you know something about this as Lance Norskog said about this?

asfimport commented 6 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

[\~steve_rowe] Thanks for pointing me into right direction. But maybe you know about putting model files somewhere in network. This is my prev question. Maybe you know something about this as Lance Norskog said about this?

Sorry, I haven't tested this, but I believe you'll have to use locally attached storage on each server, and specify an absolute path.

asfimport commented 6 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

I should mention that the ideal hosting location for OpenNLP models would be the Blob Store, but that is not currently possible for schema-loaded classes - see SOLR-9175.

asfimport commented 6 years ago

Alexey Ponomarenko (migrated from JIRA)

@sarowe Thanks, I will try your solution. But I will also wait for Lance Norskog about network solutions.

asfimport commented 6 years ago

Lance Norskog (migrated from JIRA)

I apologize, Alexey Ponomarenko, but I cannot help here. I have not worked with Solr for a few years.

asfimport commented 6 years ago

Alexey Ponomarenko (migrated from JIRA)

Lance Norskog Ok, I will try @sarowe solution, but I think that we need to add something of we have discussed to documentation. But unfortunately it seems to be that I have no enough rights to do that ;)

asfimport commented 6 years ago

Alexey Ponomarenko (migrated from JIRA)

Hi once more I am trying to implement named entities extraction using this manual https://lucene.apache.org/solr/7_3_0//solr-analysis-extras/org/apache/solr/update/processor/OpenNLPExtractNamedEntitiesUpdateProcessorFactory.html

I am modified solrconfig.xml like this:

 <updateRequestProcessorChain name="multiple-extract">
   <processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
     <str name="modelFile">opennlp/en-ner-person.bin</str>
     <str name="analyzerFieldType">text_opennlp</str>
     <str name="source">description_en</str>
     <str name="dest">content</str>
   </processor>
 </updateRequestProcessorChain>

But when I was trying to add data using:

request:

POST http://localhost:8983/solr/numberplate/update?version=2.2&wt=xml&update.chain=multiple-extract

<add><doc><field name="description_en">This is Steve Jobs 2 </field><field name="content_pos">This is text 2</field><field name="content">This is text for content 2</field></doc></add>

response

<?xml version="1.0" encoding="UTF-8"?>
<response>
    <lst name="responseHeader">
        <int name="status">0</int>
        <int name="QTime">3</int>
    </lst>
</response>

But I don't see any data inserted to content field and in any other field.

If you need some additional data I can provide it.

Can you help me? What have I done wrong?