apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.59k stars 1.01k forks source link

the korean analyzer that has a korean morphological analyzer and dictionaries [LUCENE-4956] #6020

Open asfimport opened 11 years ago

asfimport commented 11 years ago

Korean language has specific characteristic. When developing search service with lucene & solr in korean, there are some problems in searching and indexing. The korean analyer solved the problems with a korean morphological anlyzer. It consists of a korean morphological analyzer, dictionaries, a korean tokenizer and a korean filter. The korean anlyzer is made for lucene and solr. If you develop a search service with lucene in korean, It is the best idea to choose the korean analyzer.


Migrated from LUCENE-4956 by SooMyung Lee, 4 votes, updated Feb 09 2014 Attachments: eval.patch, kr.analyzer.4x.tar, lucene4956.patch, lucene-4956.patch, LUCENE-4956.patch

asfimport commented 11 years ago

SooMyung Lee (migrated from JIRA)

Hi, Christian. Until this week, I'll prepare some test cases and documents that explain how the options work and why those are needed.

asfimport commented 11 years ago

Christian Moen (@cmoen) (migrated from JIRA)

I've run KoreanAnalyzer on Korean Wikipedia and also had a look at memory/heap usage. Things look okay overall.

I believe KoreanFilter uses wrong offsets for synonym tokens, which was discovered by random-blasting. Looking into the issue...

asfimport commented 11 years ago

SooMyung Lee (migrated from JIRA)

Hi, christian.

I have made some changes of the source code and uploaded that. I have changed the source code relating to keyword extraction. I have removed the properties relating to keyword extraction and changed the keyword extraction logic. I've also added a test case that describe how the korean analyzer works.

I hope this is of some help to you!

asfimport commented 11 years ago

SooMyung Lee (migrated from JIRA)

Hi, Christian.

I have named the korean analyzer package as "kr" but recently I found that It is incorrect. "kr" is the country code of the south korean and "kp" is the country code of the north korea. I think "ko" is more suitable for the name of the korean anzlyzer package. "ko" is the korean language code. So, I want you to rename the korean analyzer package from "kr" to "ko".

asfimport commented 11 years ago

Walter Underwood (@wrunderwood) (migrated from JIRA)

Yes, "ko" is correct. Use country codes for locales, but language codes for stemmers.

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

I have named the korean analyzer package as "kr" but recently I found that It is incorrect. "kr" is the country code of the south korean and "kp" is the country code of the north korea. I think "ko" is more suitable for the name of the korean anzlyzer package. "ko" is the korean language code. So, I want you to rename the korean analyzer package from "kr" to "ko"

Hi SooMyung, thanks, I'll make the switch (unless Christian beats me to it).

asfimport commented 11 years ago

Christian Moen (@cmoen) (migrated from JIRA)

I'm happy to take care of this unless you want to do it, Steve. I can do this either tomorrow or on Friday. Thanks.

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Hi Christian,

I'm in process now, should be done in a little bit.

BTW, I also brought the branch up-to-date with trunk.

Steve

asfimport commented 11 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Thanks a lot!

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

I'm in process now, should be done in a little bit.

Done: committed the 'kr'->'ko' switch at r1486269 on branches/lucene4956/.

asfimport commented 11 years ago

SooMyung Lee (migrated from JIRA)

Hi, Steve

I see you created 4.4 branch for releasing. After I looked over it, I found that the Korean analyzer(Arirang) is missing.

Can you tell me when the korean analyzer can be incorporated into the release.

asfimport commented 11 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Hello SooMyung,

I'm the one who haven't followed up properly on this as I've been too bogged down with other things. I've set aside time next week to work on this and I hope to have Korean merged and integrated with trunk then. I'm not sure we can make 4.4, but I'm willing to put in extra effort if there's a chance we can get it in in time.

asfimport commented 11 years ago

SooMyung Lee (migrated from JIRA)

Hi, Christian

I can understand your situation. I know you run the company.

I was just wondering if there is any problem with integrating it. If you need any help, please let me know it.

asfimport commented 11 years ago

Christian Moen (@cmoen) (migrated from JIRA)

I've now aligned the branch with trunk, updated the example schema.xml to use text_ko naming for the Korean field type.

I've also indexed Korean Wikipedia continuously for a few hours and the JVM heap looks fine.

There are several additional things that can be done with this code, including generating the parser using JFlex at build time, fixing some of the position issues with random-blasting, cleanups and dead-code removal, etc. This said, I believe the code we have is useful to Korean users as-is and I'm thinking it's a good idea to integrate it into trunk and iterate further from there.

Please share your thoughts. Thanks.

asfimport commented 11 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Attaching a patch against trunk (r1513348).

asfimport commented 11 years ago

Christian Moen (@cmoen) (migrated from JIRA)

SooMyung, let's sync up regarding your latest changes (the patch you attached). I'm thinking perhaps we can merge to trunk first and iterate from there. Thanks.

asfimport commented 11 years ago

SooMyung Lee (migrated from JIRA)

Hi, Christian Thank you for your effort.

I'll review the changes what you made. I've been also made some improvements. I'll upload the patch soon.

asfimport commented 11 years ago

SooMyung Lee (migrated from JIRA)

Hi Christian,

I have sync up and I made some modification. I'm attaching the patch.

asfimport commented 10 years ago

SooMyung Lee (migrated from JIRA)

Hi Christian,

I didn't hear any news from you since last August. Do you have any problem with moving to next step?

I run a Korean developers community for the Korean Analyzer. I announced that Arirang analyzer will be incorporated into lucene and solr soon. So, many developers are waiting for that.

I want we go to next step quickly. If you need any help, Please let me know.

asfimport commented 10 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Thanks for pushing me on this. I'll have a look at your recent changes and commit to trunk shortly if everything seems fine. I hope to have this committed to trunk early next week. Sorry for this having dragged out.

asfimport commented 10 years ago

Christian Moen (@cmoen) (migrated from JIRA)

SooMyung,

The patch you uploaded on September 11th, was that made against the latest lucene4956 branch?

The patch doesn't apply properly against on lucene4956 for me. Could you clarify its origin and instruct me how it can be applied? If you can make a patch against the code on lucene4956, that would be much appreciated.

Thanks!

asfimport commented 10 years ago

SooMyung Lee (migrated from JIRA)

Christian,

Yes, I worked my last patch against on lucene4956. I'll check the problem and inform you how to solve it within today.

asfimport commented 10 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Thanks a lot.

asfimport commented 10 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Soomyung and myself met up in Seoul today and we've merged his latest locally. I'll commit the changes to this branch when I'm back in Tokyo and Soomyung will follow up with fixing a known issue afterwards. Hopefully we can commit this to trunk very soon.

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi, I have seen the same code at a customer and found a big bug in FileUtils and JarResources. We should fix and replace this code completely. It's not platform independent. We should fix the following (in my opinion) horrible code parts:

We should remove both classes completely and load resources correctly with Class#getResourceAsStream.

asfimport commented 10 years ago

SooMyung Lee (migrated from JIRA)

@uschindler Thank you for your advice, I have opened this source code in sourceforege since 2009 and have many users. but, nobody told me the bugs and I also didn't know that. Christian and myself will fix the bugs soon. Thank you again.

asfimport commented 10 years ago

Christian Moen (@cmoen) (migrated from JIRA)

SooMyung, I've committed the latest changes we merged in Seoul on Monday. It's great if you can fix the decompounding issue we came across, which we disabled a test for.

Uwe, +1 to use Class#getResourceAsStream and remove FileUtils and JarResources. I'll make these changes and commit to the branch.

Overall, I think there's a lot of things we can do to improve this code. Would very much like to hear your opinion on what we should fix before committing to trunk and getting this on the 4.x branch and improve from there. My thinking is that it might be good to get this committed so we'll have Korean working even though the code needs some work. SooMyung has a community in Korea that uses and it's serving their needs as far as I understand.

Happy to hear people's opinion on this.

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Do we need the Tokenizer here at all or just the filter?

StandardTokenizer is now tagging runs of hangul text with <HANGUL> and cjk text with <IDEOGRAPHIC> in TypeAttribute, isnt that essentially what is needed there?

The current tokenizer here just seems to be a clone of an old version of standardtokenizer.

The filter needs a reset() at the very least, that seems to be the issue with testRandom.

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1532707 from @uschindler in branch 'dev/branches/lucene4956' https://svn.apache.org/r1532707

LUCENE-4956: First step in remove buggy resources stuff:

Still needs more refactoring!

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I committed a cleanup of most of the broken and slow resources stuff. It now only uses Class.getResourceAsStream. I also removed code from the FileUtils class (now named DictionaryResources) which was clearly code cloned from somewhere else.

The resource loading can be further improved:

The code also has legal problems:

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1532737 from @uschindler in branch 'dev/branches/lucene4956' https://svn.apache.org/r1532737

LUCENE-4956: Remove stuff not really needed. TODO: add attribution, because this code is borrowed, too!

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I removed more stuff. Some code was borrowed from common-lang without attribution, too. We have to review the whole code, so we don't violate copyrights or licenses!

One thing we need to change, too: This code uses the pattern "catch all Exceptions" and rethrow as another one. This affects MorphException. This class should be removed and all methods should simply declare the Exceptions throw. Especially we are not allowed to swallow stack traces! Also code sometimes prints to System.out!

MorphException is crazy alltogether: It morphs itsself sometimes :-)

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1532739 from @uschindler in branch 'dev/branches/lucene4956' https://svn.apache.org/r1532739

LUCENE-4956: More obsolete stuff (not even used), some moves to classes where code parts are solely used

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1532747 from @uschindler in branch 'dev/branches/lucene4956' https://svn.apache.org/r1532747

LUCENE-4956: Hide ctor of static utility classes

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1532748 from @uschindler in branch 'dev/branches/lucene4956' https://svn.apache.org/r1532748

LUCENE-4956: Cleanup imports

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1532749 from @uschindler in branch 'dev/branches/lucene4956' https://svn.apache.org/r1532749

LUCENE-4956: Cleanup imports

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1532750 from @uschindler in branch 'dev/branches/lucene4956' https://svn.apache.org/r1532750

LUCENE-4956: Replace StringBuffer by StringBuilder

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I did a very quick and dirty evaluation of various analyzers (short queries only) with the HANTEC-2 test collection (http://ir.kaist.ac.kr/anthology/2000.10-%EA%B9%80%EC%A7%80%EC%98%81.pdf)

I compared 4 different analyzers for index time, size, and mean average precision for the "L2" relevance set:

For each one, I used 3 different ranking strategies: DefaultSimilarity, BM25Similarity, and DFR GL2, no parameter tuning of any sort.

Analyzer Index Time Index Size MAP(TFIDF) MAP(BM25) MAP(GL2)
Standard 31s 128MB .0959 .1018 .1028
CJK 30s 162MB .1746 .1894 .1910
Korean 195s 125MB .2055 .2096 .2058
Mecab 138s 147MB .1877 .1960 .1928

Note that on the first try, I was unable to actually index the entire collection with KoreanAnalyzer, so I had to hack the filter to prevent this:

xception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 4
    at java.lang.String.substring(String.java:1907)
    at org.apache.lucene.analysis.ko.KoreanFilter.analysisChinese(KoreanFilter.java:405)
    at org.apache.lucene.analysis.ko.KoreanFilter.incrementToken(KoreanFilter.java:147)
    at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
    at org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:54)
    at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174)

See the patch for more information (you can also download the data from http://www.kristalinfo.com/TestCollections/ and set some constants and run it yourself).

Don't read too far into it, this was really quick and dirty and might somehow be biased. For example, there are several charset issues in the test collection... But it looks like the analyzer here is effective.

asfimport commented 10 years ago

Benson Margulies (@bimargulies-google) (migrated from JIRA)

As a potential user of this technology, I'd like to ask for it to have documentation of its linguistic approach.

asfimport commented 10 years ago

SooMyung Lee (migrated from JIRA)

@bimargulies-google Korean Tokenizer has the feature that identify language (Korean, English or Chinese) in Korean sentence. Usually, eojeol in Korean sentence has some different cases. First, eojeol consists of only Korean letters, Second, eojeol can be a combination of Korean letter and alphanumeric letter. Third, eojeol consists of only all alphanumeric letters. Fourth eojeol consists of Chinese letters. Tokinizer treat first and second case as Korean so Korean Morphological analysis is processed in Korean-filter. In third case, I copied code from standard-filter for korean-filter. In fourth case, Korean-filter map Chinese letter to Korean sound and then if it is a compound noun, decompounding is processed based on dictionary.

asfimport commented 10 years ago

SooMyung Lee (migrated from JIRA)

Hi, all.

I' going to explain how I develop this code as Christian recommended because of license and legal problem that Jack Krupansky mentioned in previous comment.

I started to write this code and dictionary in 2006 based on a book which author is Seung-Shik, Kang who is a professor of Kookmin university now.

the dictionary consist of several files but major files are total.dic, josa.dic, eomi.dic and syllable.dic. in first step of developing dictionary, I collected basic stem words for total.dic and particles for josa.dic and eomi.dic from book and various websites. and then I surveyed how basic stem words can be used on online dictionaries. and I only referred to the book to make syllable.dic. the rest of files is created by myself during developing except for mapHanja.dic. I added this file two years ago. I'm not sure that this file has not legal problem because many data came from projects result so it is better to remove that data.

to make source code, I referred to the book so major logic was based on the book except for some utilities classes such as String, File and Trie.java. I copied most of utilities classes from apache common project but Trie.java from other website. I cannot remember the exact website now because it was happend long time ago. but I remember that I read the license that was Apache license.

I finished first version in 2008 and created an online community on a website (called Naver) and uploaded the source code. the number of community members are over 3700 currently. I attended an opensource contest held by Korean government organization in 2009. During the contest, I uploaded the source code to the Sourceforge and got a BlackDuck license test with this code and passed the test.

I have supported users through the online community (http://cafe.naver.com/korlucene). so some users improved dictionaries and source codes and then posted it on the website. and I merged it and opened it again.

This is the wohle process how I developed the code. If anybody has something to recommend, Please let me know it.

asfimport commented 10 years ago

Benson Margulies (@bimargulies-google) (migrated from JIRA)

I am told (I don't read Korean myself) that people often leave out the white space between eojeol that are made up entirely of Hangul letters (Korean letters). Are you just defining these very long things to be single eojeol? Prof Kang in his own work has a module that splits these using some rules.

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1533264 from @uschindler in branch 'dev/branches/lucene4956' https://svn.apache.org/r1533264

LUCENE-4956: Remove StringEscapeUtils by unescaping the mapHanja.dic

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi SooMyung Lee,

thanks for the clarification. It was not Jack Krupansky that mentioned the GPL violation, it was Robert and me. I am glad that you are aware of this and you are trying to clarify this. Indeed the License of this file is hard to find out, because the Gnutella one (which is the original) has no license header. But the whole Gnutella project is GPL licensed. Those people also started to donate this code to Google Guava and wanted to relicense to ASF2, but this is not yet done. So we cannot use this code. The missing License header may be the reason for the Blackduck test to be happy.

@cmoen offered to donate a PatriciaTrie he wrote himself. Maybe we can replace the gnutella one by this one. I would prefer the solution to not use a trie at all. Instead we should use Lucene's FST feature and bundle the whole dictionary as a serialized FST (like kuromoji does).

About the other copypasted code: I already removed all commons-io and commons-lang stuff. Commons-io was completely unneeded, because the resource handling to load resources from JAR files was not very good and can be done much easier by a simple Class#getResourceAsStream. I already implemented that and moved some class around, so be sure to update your svn before working more on the module.

I also removed the \u-escaping from the mapHanja.dic file, so I was able to remove the StringEscapeUtil class, which did too much unescaping (not only \u, also \n, \t,...)! But we should really check the license of this file or create a new one from Unicode tables. I left the file in SVN (converted to plain UTF-8) for now.

I am currently working on rewriting some code that creates too many small objects like strings all the time, because this slows down indexing! E.g. HanjaUtils should not use a String just to lookup a single char in a map. There are better data structures to hold the mapHanja table.

We should also not use readLines() to load all dictionaries into heap, then use an iterator over it and convert them to something else. We should use a BufferedReader and read it line by line and do the processing directly.

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

the rest of files is created by myself during developing except for mapHanja.dic. I added this file two years ago. I'm not sure that this file has not legal problem because many data came from projects result so it is better to remove that data.

Do you have some documentation who gave the file to you or where you downloaded it? Some university? Some CD-ROM?

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Maybe we can reconstitute this file from other hanja-hangul mappings with clear licenses?

I have not done any processing, I will investigate sources such as https://code.google.com/p/google-input-tools/source/browse/src/chrome/os/nacl-hangul/misc/symbol.txt and unihan and see what it looks like.

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1533277 from @uschindler in branch 'dev/branches/lucene4956' https://svn.apache.org/r1533277

LUCENE-4956: Remove lazy dictionary loading, don't convert to string all the time. This may be improved further if we use an array and substract the smallest codepoint value

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1533278 from @uschindler in branch 'dev/branches/lucene4956' https://svn.apache.org/r1533278

LUCENE-4956: More improvements

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1533282 from @uschindler in branch 'dev/branches/lucene4956' https://svn.apache.org/r1533282

LUCENE-4956: Remove thread-unsafe lazy loading. Initialize in static ctor

asfimport commented 10 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1533286 from @uschindler in branch 'dev/branches/lucene4956' https://svn.apache.org/r1533286

LUCENE-4956: Remove thread-unsafe lazy loading. Initialize in static ctor