apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.49k stars 988 forks source link

Build kuromoji system dictionary as a separated jar and load it from JapaneseTokenizer at runtime [LUCENE-8869] #9912

Open asfimport opened 5 years ago

asfimport commented 5 years ago

This is a sub-task for #9860. In this issue, I will try to make small but self-contained changes to kuromoji system dictionary.

Also, some refactoring of the directory/source tree structure may be needed.


Migrated from LUCENE-8869 by Tomoko Uchida (@mocobeta), 1 vote, updated Jun 23 2019 Linked issues:

asfimport commented 5 years ago

Tomoko Uchida (@mocobeta) (migrated from JIRA)

As a first step, I moved dictionary data (dat files) to a separated jar on my local branch. https://github.com/mocobeta/lucene-solr-mirror/commit/9def2b22f4e7467bef72edfac84c9f74f67289aa

In order to build and ship two jars (one for kuromoji analyzer, one for the system dictionary), I slightly changed the directory structure:

analysis/kuromoji/
├── build.xml
├── ivy.xml
├── src
│     ├── java
│     │     ├── org
│     │     └── overview.html
│     ├── resources
│     │     ├── META-INF
│     │     └── org
│     ├── test
│     │     └── org
│     └── tools
│           ├── java
│           ├── patches
│           └── test
└── sysdic
        └── src
              └── resources

Here, sysdic directory is added and all dat files are placed to sysdic/src/resources instead of src/resources by the build-dict task.

On the JapaneseTokenizer side, currently it holds all dictionary data within static singleton fields, we need to make it possible to flexibly load the dictionary data from a jar or a directory path (for testing purpose) when initializing a tokenizer so that users can choice arbitrary dictionary at runtime.

asfimport commented 5 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

@mocobeta there might be some minor conflicts with #9914, since it also touches the code that reads the resources, but they should be easy to resolve, I think?

 

asfimport commented 5 years ago

Tomoko Uchida (@mocobeta) (migrated from JIRA)

@msokolov thanks for notifying, there may be minor conflicts but yes, they would be easily resolved. (Seems you are almost done so I will pick the changes from the master.)