contrib intelligent Analyzer for Chinese [LUCENE-1629] - Githubissues

apache / lucene

Apache Lucene open-source search software

https://lucene.apache.org/

Apache License 2.0

2.68k stars 1.03k forks source link

contrib intelligent Analyzer for Chinese [LUCENE-1629] #2703

Closed asfimport closed 15 years ago

asfimport commented 15 years ago

I wrote a Analyzer for apache lucene for analyzing sentences in Chinese language. it's called "imdict-chinese-analyzer", the project on google code is here: http://code.google.com/p/imdict-chinese-analyzer/

In Chinese, "我是中国人"(I am Chinese), should be tokenized as "我"(I) "是"(am) "中国人"(Chinese), not "我" "是中" "国人". So the analyzer must handle each sentence properly, or there will be mis-understandings everywhere in the index constructed by Lucene, and the accuracy of the search engine will be affected seriously!

Although there are two analyzer packages in apache repository which can handle Chinese: ChineseAnalyzer and CJKAnalyzer, they take each character or every two adjoining characters as a single word, this is obviously not true in reality, also this strategy will increase the index size and hurt the performance baddly.

The algorithm of imdict-chinese-analyzer is based on Hidden Markov Model (HMM), so it can tokenize chinese sentence in a really intelligent way. Tokenizaion accuracy of this model is above 90% according to the paper "HHMM-based Chinese Lexical analyzer ICTCLAL" while other analyzer's is about 60%.

As imdict-chinese-analyzer is a really fast and intelligent. I want to contribute it to the apache lucene repository.

Migrated from LUCENE-1629 by Xiaoping Gao, resolved May 14 2009 Environment:

for java 1.5 or higher, lucene 2.4.1

Attachments: analysis-data.zip, bigramdict.mem, build-resources.patch (versions: 2), build-resources-with-folder.patch, coredict.mem, LUCENE-1629-encoding-fix.patch, LUCENE-1629-java1.4.patch

asfimport commented 15 years ago

Xiaoping Gao (migrated from JIRA)

Here is all the source code of intelligent analyzer for Chinese. About 2500 lines The unit TestCase contains a main method, which needs lexical dictionary to run, so I will post the binary lexical dictionary soon.

asfimport commented 15 years ago

Xiaoping Gao (migrated from JIRA)

Lexical dictionary files, unzip it to somewhere, run TestSmartChineseAnalyzer with this command: java org.apache.lucene.analysis.cn.TestSmartChineseAnalyzer -Danalysis.data.dir=/path/to/analysis-data/

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch looks good – thanks Xiaoping!

One problem is that contrib/analyzers is currently limited to Java 1.4, and I don't think we should change that at this point (though in 3.0, we will change it to 1.5). How hard would it be to switch your sources to use only Java 1.4?

A couple other issues:

Each copyright header is missing the starting 'S' in the sentence 'ee the License for the specific language governing permissions and'
Can you remove the @author tags? (Lucene sources don't include author tags anymore)

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi Xiaoping,

looks good, but I have some suggestions:

Making the datafile only readable from a RandomAccessFile makes it hard to e.g. move the data file directly into the jar file. I would like to put the data file directly into the package directory and load it with Class.getResourceAsStream(). In this case, the binary Lucene analyzer jar would be ready-to-use and the analyzer would run out of the box. Often configuring external files in e.g. web applications is complicated. An automatism to load the file from the JAR would be fine.
I have seen some singleton implementations, where the getInstance() static method is not synchronized. Without it there may be more than one instance, if different threads call getInstance() at the same time or close together.
Do we compile the source files with a fixed encoding of UTF-8 (build.xml?). If not, there may be problems, if the Java compiler uses another encoding (because platform default).

asfimport commented 15 years ago

Xiaoping Gao (migrated from JIRA)

to McCandless: There is lots of code depending on Java 1.5, I use enum, generalization frequently. Because I saw these points on apache wiki:

All core code to be included in 2.X releases should be compatible with Java 1.4.
All contrib code should be compatible with either Java 5 or 1.4. I have corrected the copyright header and @author tags, thank you.

to Schindler:

This is really a good idea, I wanna to move the data file into jar in next develop cycle, but now I need to make some changes to the data files independently, can I just commit the codes now?
I have changed the getInstance() method to synchronized
All the source files are fixed encoded using UTF-8, and I had put a notice in package.html, Should I do something else?

Thank you all!

asfimport commented 15 years ago

Xiaoping Gao (migrated from JIRA)

New patch in reply to Michael McCandless and Uwe Schindler 's comments.

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Hi,

I see in the paper that lexical resources were also developed for Big5 (traditional chinese). Are you able to acquire these resources with BSD license as well?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

There is lots of code depending on Java 1.5, I use enum, generalization frequently. Because I saw these points on apache wiki:

Well... "in general" contrib packages can be 1.5, but the analyzers contrib package is widely used, and is not 1.5 now, so it's a biggish change to force it to 1.5 with this. We should at least separate discuss in on java-dev if we want to consider allowing 1.5 code into contrib-analyzers.

We could hold off on committing this until 3.0?

asfimport commented 15 years ago

Xiaoping Gao (migrated from JIRA)

I have ported the code to Java1.4 today, fortunately there were not so much problems.

"Lucene-1629-java1.4.patch" is all the code working on Java 1.4, I have just changed it to fit Java1.4 code style.Data structures and algorithms are not modified. It has been tested that it would produce the very same result, just with a slight affection on speed.

asfimport commented 15 years ago

Xiaoping Gao (migrated from JIRA)

all the code working on java1.4

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

all the code working on java1.4

Fabulous, thanks Xiaoping!

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

When I apply the patch and then run "ant test" in contrib/analyzers, I'm hitting this compilation error:

compile-core:
    [mkdir] Created dir: /lucene/src/cn.1629/build/contrib/analyzers/classes/java
    [javac] Compiling 88 source files to /lucene/src/cn.1629/build/contrib/analyzers/classes/java
    [javac] /lucene/src/cn.1629/contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/AnalyzerProfile.java:98: load(java.io.InputStream) in java.util.Properties cannot be applied to (java.io.FileReader)
    [javac]       prop.load(reader);
    [javac]           ^
    [javac] Note: Some input files use or override a deprecated API.
    [javac] Note: Recompile with -Xlint:deprecation for details.
    [javac] 1 error

asfimport commented 15 years ago

Xiaoping Gao (migrated from JIRA)

new patch for java1.4, I have corrected the bug "java.util.Property.load(Reader)". The new code can now be compiled now.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Xiaoping, could you turn the TestSmartChineseAnalyzer into a real JUnit test case? (Ie, invoke that sample method from the testChineseAnalyzer method)?

Also, it looks like you didn't switch to Class.getResourceAsStream() (Uwe's suggestion above) – are you planning on doing that?

Finally, Robert asked a question above (about Big5) that maybe you missed?

Do we compile the source files with a fixed encoding of UTF-8 (build.xml?). If not, there may be problems, if the Java compiler uses another encoding (because platform default).

Lucene's common-build.xml already sets the encoding (for javac) to utf-8. So I think we're good here...

asfimport commented 15 years ago

Xiaoping Gao (migrated from JIRA)

to Robert Muir: The dictionary only supports GB2312 encoding now, which has about 6800 characters, so I don't think it can support big5 encoding with this dictionary. You can ask the author about the big5 issue. May be he has another lexical dictionary.

Now I will switch to Class.getResourceAsStream() to load the dictionary, so the user don't have to download the dictionary independently. After that I can write a real JUnit test case.

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Xiaoping, thanks. I see they didn't get great performance with big5 tests but just curious.

Maybe mention somewhere in the javadocs that this analyzer is for simplified chinese text, just so its clear?

asfimport commented 15 years ago

Xiaoping Gao (migrated from JIRA)

changes

Add two binary dictionary files into the java package: coredict.mem(1.6M) bigramdict.mem(4.7M), I'll post them after this
Using Class.getResourceAsStream() to load the dictionary, so users don't need to download dictionaries manually.
Switch TestSmartChineseAnalyzer into a real JUnit test case

asfimport commented 15 years ago

Xiaoping Gao (migrated from JIRA)

two binary dictionary files, please put them into contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

When I run "ant test" in contrib/analyzers, SmartChineseAnalyzer is unable to locate the stopwords.txt:

    [junit] Testcase: testChineseAnalyzer(org.apache.lucene.analysis.cn.TestSmartChineseAnalyzer):  Caused an ERROR
    [junit] null
    [junit] java.lang.NullPointerException
    [junit]     at java.io.Reader.<init>(Reader.java:61)
    [junit]     at java.io.InputStreamReader.<init>(InputStreamReader.java:80)
    [junit]     at org.apache.lucene.analysis.cn.SmartChineseAnalyzer.loadStopWords(SmartChineseAnalyzer.java:112)
    [junit]     at org.apache.lucene.analysis.cn.SmartChineseAnalyzer.<init>(SmartChineseAnalyzer.java:71)
    [junit]     at org.apache.lucene.analysis.cn.TestSmartChineseAnalyzer.testChineseAnalyzer(TestSmartChineseAnalyzer.java:36)

asfimport commented 15 years ago

Xiaoping Gao (migrated from JIRA)

On Mon, May 11, 2009 at 6:57 PM, Michael McCandless (JIRA)

stopwords.txt should be in the same package as org.apache.lucene.analysis.cn.SmartChineseAnalyzer , can you find it there?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I do have the file, but at runtime the JRE cannot locate it using Class.getResourceAsStream().

Are you able to run "ant test -Dtestcase=TestSmartChineseAnalyzer" from the command line in contrib/analzyers successfully?

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Did the <jar> ANT task also adds the non *.class files? During compilation, the additional files must be copied to the build directory, this is normally done by an additional copy task (I do it in this way). The Packager then packs all files below build into the jar file. Maybe the build script must be modified? I will try this out later.

asfimport commented 15 years ago

Xiaoping Gao (migrated from JIRA)

I think Schindler should be right. I modified the code to skip loading stopwords.txt, but NullPointerException pop out again when loading coredict.mem file. When I run TestSmartChineseAnalyzer using eclipse, it just run successfully. So the problem might exist in the ant build script.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I did some checks now, it is the problem of the ant script. Because of this, e.g. ArabicAnalyzer throws an IOException (but this is not tested, and so no test failures occur). The ant script should copy all the data files to the build/classes directory after compiling and before jaring.

I do not know, how to fix this correctly, because I do not fully understand all the parts of the build files and how maven and common-build.xml works together with contrib-build and so on. The simpliest would be to customize the "compile" target for the analyzers package and list there all files that must be copied during the compilation step.

Should I open an additional bug report for the ArabicAnalyzer, or should we fix the build.xml for analyzers with this case?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

The simpliest would be to customize the "compile" target for the analyzers package and list there all files that must be copied during the compilation step.

Let's just do this fix, under this issue, for all contrib/analyzers that need to load a resource?

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi Mike,

here is a patch that adds a maven-like resources directory. It patches the build script in two ways:

The junit test classpath is extended to include src/resources
The jarify macro is changed to also add src/resources to the jar file

So all resource files mut be put into the corresponding subdirectory under src/resources. The patch contains this for the stopword.txt file af the arabic analyzer. The data files should be removed from src/java.

The cn analyzers stopwords must be put in the top-level cn directory, the mem files into cn/smart/hhmm (I took me some time to find this out).

The patch also includes some src/resources directory additions. For the compilation to work, every src/ directory now needs at least an empty resources folder. I found no way to make the jarify macro work without this?

If somebody has an idea, it would be good.

asfimport commented 15 years ago

Xiaoping Gao (migrated from JIRA)

I think it is unacceptable to ask every package to have a resources folder,
can we write the build script to test whether the resources file exists,
like this: <available property="resources.exists" file="${resources.dir}" type="dir"/> <target name="index" depends="compile" description="Build WordNet index"> <do_something if="sources.exists"> package the reources. </do_something>

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I know this, the problem with th lucene build is that JAR ing is done using a macro called <jarify>. And here this is not possible. From ANT 1.7.1 on there is the possibility to specify a "erroronmissingdir" when using <fileset/>: http://ant.apache.org/manual/CoreTypes/fileset.html

I do not know what version of ant we require, but using it, the error can be avoided.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

(Shooting in the dark, here, since I'm no ant expert...)

Lucene's common-build.xml has this:

<!-- Copy any data files present to the classpath -->
<copy todir="@{destdir}">
  <fileset dir="@{srcdir}" excludes="**/*.java"/>
</copy>

Which for all tests will copy any resources (any file that's not *.java) into the corresponding build/classes directory; eg contrib/xml-query-parser's tests rely on this. This approach doesn't cause any errors when a given contrib module doesn't have resources. Is there some way to use a similar approach here (and not bump up the minimum ant version required)?

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I wonder, why this build fragment did not work for contrib? The only problem is, that this also copies the package. and overview javadoc files. They should also be excluded.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

That fragment is under "compile-test-macro", which is run only on src/test/*. I agree, we should fix it to not copy package/javadoc files.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I will look into it this evening and provide a patch.

Because of the file exclusion problematics, I thought, the approach to have a separate resources directory (like Maven does it), would be a great new invention. We could also do this for the tests. In my opinion, data files should be separated from source files. And by adding the resources folder to classpath during tests saves a lot of disk space during compilation and testing (ok, thats not important). By this compilation/test class path and building the jar files are separate tasks. The problem with my current approach is only, that the JAR packager fails, when the directory is not available :( - Is it so bad to just add an empty resources folder to every compilation unit? This would be similar to Maven.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK, I agree, separation of resources from source code is good.

Can we limit the required addition of src/resources/org/apache/lucene/* to just contrib/analyzers? Ie, somehow only override its jarify macro?

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Its only needed to have the src/resources folder, no subfolders, I think it would be no problem to add this folder to every compilation unit (I added it to my svn in minutes). The good thing is, that future developments then know, where to put the resource files. But I agree, there should be a better way to automatically detect the resources folder before ANT 1.7.1.

Maybe we should ask Erik Hatcher as the ANT specialist...!

asfimport commented 15 years ago

Erik Hatcher (@erikhatcher) (migrated from JIRA)

My initial thought is to move the <copy> excluding ``` */.java and */.html

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Here another try with Erik's suggestion: I moved the <copy> task to the compile macro and extended the list of exclusions. With some work and verbose=true, I added all "source" files to the exclusion (also .jj and so on).

Using this patch, you can compile Xiaoping Gao patch, add the resources to cn/ and cn/smart/hhmm/ and they appear in classpath for testing and the final jar file.

My problem with this is the messy exclusion list. During reading ANT docs, I dound out that there is the possibility with the <copy> task to not stop on errors. The idea is now again to put the data files into a maven-like resources folder and just copy them to the classpath (if the folder does not exist, copy would simply do nothing).

I post a patch/test later.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

This is a second try, again with the resources folder. It is now optional, to have a src/resources folder, if it exists, all files from inside are copied to the build destination.

The trick was, that the copy task can additionally use a globmapping, and by that, does the following:

The source fileset of the copy task uses the src/ folder directly
The fileset only includes resources/**
Because then the target folder would get an additional sub-folder "resources" (because the base dir of the copy operation is "src/"), the filenames are replaced by a globmapping, stripping the "resources/" from the relative path

This patch also adds a simple test case, that shows, that ArabicAnalyzer does not start correctly, when the stopwords.txt file is not in the classpath. The test fails, if the stopwords.txt file stays at the original location and/or the copy operation is commented out.

The patch does not contain the deletion of the arabic stopwords file from the sources folder (was binary), so remove it by hand or simply move it after aplying the patch.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Awesome! I've applied your patch, Uwe, and moved ArabicAnalyzer's stopwords.txt, as well as SmartChineseAnalyzer's stopwords.txt, bigramdict.mem, coredict.mem, under their respective subdirs under src/resources/*. I confirmed TestArabicAnalyzer passes (and verified it really did instantiate ArabicAnalyzer). All tests pass.

I will commit shortly.

This issue is a delightful example of the collaboration that makes open source development work so well. Thanks Xiaoping, Uwe and Erik!

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks everyone!

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Fine! Should I commit the ArabicAnalyzer test, too? But I think the test is not really needed, as the new chinese analyzer already tests for the resources implicit.

One thing: The change is in the main changes.txt, normally it should be in contrib's changes.txt, or not? If it should stay there, we should also add Spatial and TrieRange to main changes.txt.

And one other thing: The analyzer (and many more) use the old TokenStream API at the moment, we should change this before 2.9 for all contrib analyzers, see #2534?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Should I commit the ArabicAnalyzer test, too?

Woops, I missed it – I'll commit it. The more tests the better!

The change is in the main changes.txt, normally it should be in contrib's changes.txt, or not?

Woops – you're right. I'll move this to contrib's CHANGES.txt.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

The analyzer (and many more) use the old TokenStream API at the moment, we should change this before 2.9 for all contrib analyzers, see #2534?

Yes – we need to resolve #2534 (and a great many more; the list keeps growing!) before 2.9.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi Mike, a small patch: The HTML files generated by Javadoc do not contain the charset header and are displayed as ISO-8859-1. This breaks the docs for the chinese analyzer. The attached patch sets the output encoding correctly to UTF-8 using the <meta/> html tag.

asfimport commented 15 years ago

Xiaoping Gao (migrated from JIRA)

Test successful on my laptop now! Thank all of you for your patience and hard work! I will continue to maintain this analyzer and develop new features.

Best Wishes!

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK, I just committed that fix (javadocs encoding == UTF-8) Uwe. Thanks.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi Xiaoping,

Thanks! The code is now committed.

Only for the understanding (as I do not know chinese and cannot read some comments), some questions/comments: The .mem files are serializations of the dictionaries. They are created by loading from the random access file (these dct files) and then serialized to the mem files. But for developers and further updates you need to have the dct files and rerun these steps (that are all these private methods). An interesting addition would be to create a custom build step, that uses the dct files and builds the .mem files from it. How could I invoke that? So maybe you could extract the useless dct file loaders from the current classes and create a separate tool from it, that could be invoked from ant, that builds that mem files.

Uwe

P.S.: By the way: In these private conversation methods (that are never called from the library code) you have these default try-catch blocks, which is bad programming practice. So the proposed separate conversion tool should correctly handle the exceptions or better just not catch them at all and pass up (side note: I hate eclipse for generating these auto-catch blocks, better would be to auto-add throws-clauses to the method signatures!)

asfimport commented 15 years ago

Mingfai Ma (@mingfai) (migrated from JIRA)

hi Xiaoping,

I'm interested to get the Chinese analyzer work for Traditional Chinese (UTF-8/Big5). Just wonder if your coredict.dct comes from ICTCLAS? (http://ictclas.org/Down_share.html) if yes, is it 2009 or 2008?

The ICTCLAS has traditional chinese edition for its 2008 release. But the distribution are not in .dct. I wonder if we have a simple specification for the .dct so I could find a way to convert the ICTCLAS's lexical dictionary to the .dct format to work with your library?

asfimport commented 15 years ago

Xiaoping Gao (migrated from JIRA)

Hello Mingfai!

coredict.mem is converted from coredict.dct which come from ICTCLAS1.0,
neither 2008 nor 2009. The author authorized me to release just the lexical dictionary from
ICTCLAS1.0 under APLv2, but he didn't authorize the dictionary of
ictclas2008\~2009. As far as I know, coredict.dct just contain GB2312 characters, so it cannot
support Big5.

I think we should find the proper big5 dictionary first, then I will help
you to convert to dct file.

On May 15, 2009 6:20pm, "Mingfai Ma (JIRA)" <jira@apache.org> wrote:

asfimport commented 15 years ago

Xiaoping Gao (migrated from JIRA)

Hello Mingfai!

coredict.mem is converted from coredict.dct which come from ICTCLAS1.0,
neither 2008 nor 2009. The author authorized me to release just the lexical dictionary from
ICTCLAS1.0 under APLv2, but he didn't authorize the dictionary of
ictclas2008\~2009. As far as I know, coredict.dct just contain GB2312 characters, so it cannot
support Big5.

I think we should find the proper big5 dictionary first, then I will help
you to convert to dct file.

On May 15, 2009 6:20pm, "Mingfai Ma (JIRA)" <jira@apache.org> wrote:

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

if you acquire the big5 resources, do you think it would be possible to create a single dictionary that works with both Simplified & Traditional?

(i.e. merge the big5 resources with the gb resources)

The reason I say this, is the existing chinese analyzers, although they tokenize in a less intelligent way, they are agnostic to Simplified/Traditional issues...

Next