apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.57k stars 1.01k forks source link

the korean analyzer that has a korean morphological analyzer and dictionaries [LUCENE-4956] #6020

Open asfimport opened 11 years ago

asfimport commented 11 years ago

Korean language has specific characteristic. When developing search service with lucene & solr in korean, there are some problems in searching and indexing. The korean analyer solved the problems with a korean morphological anlyzer. It consists of a korean morphological analyzer, dictionaries, a korean tokenizer and a korean filter. The korean anlyzer is made for lucene and solr. If you develop a search service with lucene in korean, It is the best idea to choose the korean analyzer.


Migrated from LUCENE-4956 by SooMyung Lee, 4 votes, updated Feb 09 2014 Attachments: eval.patch, kr.analyzer.4x.tar, lucene4956.patch, lucene-4956.patch, LUCENE-4956.patch

asfimport commented 11 years ago

SooMyung Lee (migrated from JIRA)

8edffacb15b3964f25054c82c0d4ea92

asfimport commented 11 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Thanks again, SooMyung!

I'm seeing that Steven has informed you about the grant process on the mailing list. I'm happy to also facilitate this process with Steven.

Looking forward to getting Korean supported.

asfimport commented 11 years ago

soomyung (migrated from JIRA)

Thanks for your help and your great concern , Christian!

I visited your website. I noticed that you are not a Japanese and you developed a Japanese Morphological Analyzer.

How could it be possible? I'm surprising at your work.

asfimport commented 11 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

That's because Christian has ninja superpowers. http://goo.gl/5EPMr

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

The IP clearance form for this donation is here: http://incubator.apache.org/ip-clearance/lucene-korean-analyzer.html. I don't have karma to rebuild the website after I commit changes to the XML source, so there will be delays of a day or so between updates and those updates' appearance on the website.

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

I think this donation should be packaged in its own jar, similarly to kuromoji, smartcn, morfologik and stempel, and so should end up at lucene/analysis/korean/.

soomyung, do you have a good name for the analysis module this will become, rather than "korean"? I'd prefer a name that would allow us to add more Korean analysis modules in the future without having to rename this one.

The Lucene PMC received notification today that SooMyung's code grant and ICLA paperwork have been received and recorded.

@cmoen, now that we have SooMyung's code grant and ICLA recorded, we can start making header modifications. I suggest we create a branch off trunk, create the new module there, check in the files from the tarball attached here, commit, iterate on headers/licensing, and finally hook the new module into the build.

asfimport commented 11 years ago

SooMyung Lee (migrated from JIRA)

Hi Steve, I think "arirang" is the best name for the korean analysis modules. "arirang" is the name of traditional korean song. So, I think "arirang" can represent korean analysis modules well.

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Hi Steve, I think "arirang" is the best name for the korean analysis modules. "arirang" is the name of traditional korean song. So, I think "arirang" can represent korean analysis modules well.

Thanks SooMyung, "arirang" it is.

asfimport commented 11 years ago

Jack Krupansky (migrated from JIRA)

As a user trying to browse and find analyzers and tokenizers for specific languages, I object. I mean, I should be able to look at the language code and guess what module it might be in. It's one thing if the module name is reasonably general and there is a reasonable expectation that average users would readily associate it with specific langauges, or to categorically group languages, but just giving an artificial, non-obvious name to the module than would not be obvious to an average user seems like a poor choice, to me.

Even if you just called the module "korean", at least that would be a helpful guide to people like me browsing the list of modules. and then the package name can distinguish the implementations for that language.

Also, it should be possible to mix multiple implementations for the same langauge in the same application, so, the package name does not to have some unique name, unless there is guaranteed to be only one implementation for that language.

I would suggest that there should be two choices for language-based analysis modules:

  1. Category name, where there is some general approach that covers a number of langauges and need to share classes.
  2. Language code, hyphen, some arbitrary name for implementations that cover only a single language.

Even for #1, I would suggest that there should be a prefix that covers the "type" of languages covered (eastern european, asian, etc.)

That said, I would not stand in the way of adding Korean analysis as soon as possible. I mean, this contribution shouldn't have to correct all of the sins of past contributions.

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Jack, I think documentation can address most of your concerns. See e.g. the descriptions for the analyzer packages in the API javadocs section of the top-level per-release docs: http://lucene.apache.org/core/4_2_1/index.html. Fortunately, a module's name is not the only opportunity to describe its functionality.

Even if you just called the module "korean", at least that would be a helpful guide to people like me browsing the list of modules. and then the package name can distinguish the implementations for that language.

-1. The stempel and morfologik analysis modules are both Polish analyzers - if the first one had been named "polish", what would we have done with the second one?

Also, it should be possible to mix multiple implementations for the same langauge in the same application, so, the package name does not to have some unique name, unless there is guaranteed to be only one implementation for that language.

I agree that mixing same-language implementations should be possible in the same application. I have no idea what you're saying after that. Maybe an example?

asfimport commented 11 years ago

Jack Krupansky (migrated from JIRA)

The stempel and morfologik analysis modules are both Polish analyzers - if the first one had been named "polish", what would we have done with the second one?

That's exactly what I was talking about.

We have four distinct concepts:

  1. Module name.
  2. Package name.
  3. Source tree path.
  4. Module jar name.

They should incorporate both the language code and the "implementation name" (e.g., "stempel" or "morphologik").

The module should be something like "analysis/pl/stempel" or "analysis/stempel/pl". I prefer the former - it says that the first priority is to organize by language, and secondarily by implementation.

And the package name should be something like "org.apache.lucene.analysis.pl.stempel" or "org.apache.lucene.analysis.stempel.pl". I prefer the former, for the same rationale as for module name.

There seems to be a third form of name "analyzer-xxx". But as far as I can tell it is only an artifact of the doc or make some old Lucene thing.

And then there are the partial names for the individual jar files. There seems to be both "lucene-analyzers-stempel-x.y.z" and "lucene-analyzers-morphologik-x.y.z" in contrib/lucene-libs and then multiple "morpologik-a.b.c" jars in contrib.lib.

In short, to answer your question more directly, in my ideal world we would have srource tree and package names like:

lucene/analysis/pl/stempel/src lucene/analysis/pl/morphologik/src lucene/analysis/ko/arirang/src

org.apache.lucene.analysis.pl.stempel org.apache.lucene.analysis.pl.morfologik org.apache.lucene.analysis.ko.arirang

This would allow multiple implementations for a single language in the same application.

Although I could see reversing the language and implementation names if there is some need to share implementation code across languages.

asfimport commented 11 years ago

Walter Underwood (@wrunderwood) (migrated from JIRA)

Yes, including the ISO language code in the naming would be a very good idea. You still get into odd situations like Bokmal and Nynorsk, but you are still way ahead.

asfimport commented 11 years ago

Christian Moen (@cmoen) (migrated from JIRA)

The Korean analyzer should be named named org.apache.lucene.analysis.kr.KoreanAnalyzer and we'll provide a ready-to-use field type text_kr in schema.xml for Solr users, which is consistent with what we do for other languages.

As for where the analyzer code itself lives, I think it's fine to put it in lucene/analysis/arirang. The file lucene/analysis/README.txt documents what these modules are and the code is easily and directly retrievable in IDEs by looking up KoreanAnalyzer (the source code paths will be set up by ant eclipse and ant idea).

One reason analyzers have not been put in {{lucene/analysis/common} in the past is that they require dictionaries that are several megabytes large.

Overall, I don't think the scheme we are using is all that problematic, but it's true that MorfologikAnalyzer and SmartChineseAnalyzer doesn't align with it. The scheme doesn't easily lend itself to different implementations for one language, but that's not a common case today although it might become more common in the future.

In the case of Norwegian (no), there are ISO language codes for both Bokmål (bm) and Nynorsk (nn), and one way of supporting this is also to consider these as options to NorwegianAnalyzer since both languages are Norwegian. See SOLR-4565 for thoughts on how to extend support in NorwegianMinimalStemFilter for this.

A similar overall approach might make sense when there are multiple implementations of a language; end-users can use a analyzer named <Language>Analyzer without requiring users to study the difference in implementation before using. I also see problems with this, but it's just a thought...

I'm all for improving our scheme, but perhaps we can open up a separate JIRA for this and keep this one focused on Korean?

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

As for where the analyzer code itself lives, I think it's fine to put it in {{lucene/analysis/arirang}}.

+1

I'm all for improving our scheme, but perhaps we can open up a separate JIRA for this and keep this one focused on Korean?

+1

asfimport commented 11 years ago

Jack Krupansky (migrated from JIRA)

Looking at the actual tar file, I notice that it has the factory classes placed in "solr" directories rather than in the lucene directories as factories are normally organized.

By all means proceed with producing a normal patch that shows the final organization of this new analysis package.

Some other issues:

  1. Complete absence of Java doc for the tokenizer factory and token filter factory classes - it is not "Solr user-ready" at present. There should be an XML example of a token filter with the parameters, as is the usual practice in Lucene/Solr.

  2. No Apache license headers in the "Solr" code. I thought this stuff was already supposed to be ASL 2.0?

  3. No Solr schema.xml change to add the text_ko field type.

  4. At least the KoreanAnalyzer.java and KoreanTokenizer.java source code have tab characters - odd format. Need to be normalized for Lucene project conventions.

  5. There is a hardwired stop word list in KoreanAnalyzer that appears to be nearly identical or close to StopAnalyzer.ENGLISH_STOP_WORDS_SET. Why doesn't that static code copy the StopAnalyzer list and then add the few extra terms that are needed? If there is a reason, place it in a comment.

But as I said, by all means proceed to a normal patch file now that the tar contribution is "legal".

asfimport commented 11 years ago

Edward J. Yoon (migrated from JIRA)

I think this would be a valuable addition to the Apache Lucene (P.S., I'm Korean as you may know).

It would be nice if you can remove all the korean comments or strings, and author tags in source code to avoid any compiling and installing problems. Otherwise, SVN server/client settings and build-script's encoding options etc. will be somewhat tricky. For example,

if(entry!=null&&!("을".equals(end)&&entry.getFeature(WordEntry.IDX_REGURA)==IrregularUtil.IRR_TYPE_LIUL)) {

and, 

/**
 * 복합명사의 개별단어에 대한 정보를 담고있는 클래스 
 * `@author` S.M.Lee
 *
 */
asfimport commented 11 years ago

SooMyung Lee (migrated from JIRA)

Hi Steve, What should I do in the present situation, Do I need to make a correction to all issues and submit new tarball? Please let me know what I have to do to move forward!

asfimport commented 11 years ago

Christian Moen (@cmoen) (migrated from JIRA)

SooMyung, I don't think you need to do anything at this point. I think a good next step is that we create a new branch and check the code you have submitted onto that branch. We can then start looking into addressing the headers and other items that people have pointed out in comments. (Thanks, Jack and Edward!)

Steve, will there be a vote after the code has been checked onto the branch? If you think the above is a good next step, I'm happy to start working on this either later this week or next week. Kindly let me know you prefer to proceed. Thanks.

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Steve, will there be a vote after the code has been checked onto the branch?

Christian, before the VOTE on incubator-general can be called, the file header and licensing issues need to be completely addressed and vetted by us, working with SooMyung to make sure we get everything right.

If you think the above is a good next step, I'm happy to start working on this either later this week or next week.

+1. Thanks for working on this!

asfimport commented 11 years ago

Christian Moen (@cmoen) (migrated from JIRA)

A quick status update on my side is as follows:

I've put the code into an a module called arirang on my local setup and made a few changes necessary to make things work on trunk. KoreanAnalyzer now produces Korean tokens and some tests I've made passes when run from my IDE.

Loading the dictionaries as resources need some work and I'll spend time on this during the weekend. I'll also address the headers, etc. to prepare for the incubator-general vote.

Hopefully, I'll have all this on a branch this weekend. I'll keep you posted and we can take things from there.

asfimport commented 11 years ago

Commit Tag Bot (migrated from JIRA)

[lucene4956 commit] cm http://svn.apache.org/viewvc?view=revision&revision=1479228

Branch to work on Korean (LUCENE-4956)

asfimport commented 11 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Hello SooMyoung,

Could you comment about the origins and authorship of org.apache.lucene.analysis.kr.utils.StringUtil in your tar file?

I'm seeing a lot of authors in this file. Is this from Apache Commons Lang? Thanks!

asfimport commented 11 years ago

Christian Moen (@cmoen) (migrated from JIRA)

I've created branch lucene4956 and checked in an arirang module in lucene/analysis. I've added a basic test that tests segmentation, offsets, etc.

Other updates:

My next step is to fix the compilation related warning altogether and once we confirmed StringUtils, I think we can do the incubator-general vote. I'll keep you posted.

I think we should also consider rewriting and optimise some of the code here and there, but that's for later. It's great if you can be involved in this process, SooMyoung! I'll probably need your help and good advice here and there. :)

asfimport commented 11 years ago

Commit Tag Bot (migrated from JIRA)

[lucene4956 commit] sarowe http://svn.apache.org/viewvc?view=revision&revision=1479239

LUCENE-4956: add IntelliJ test run config for Arirang; add Maven config for Arirang

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

I've created branch lucene4956 and checked in an arirang module in lucene/analysis. I've added a basic test that tests segmentation, offsets, etc.

Cool!

License headers have been added to all source code files

I can see one that doesn't have one: TestKoreanAnalyzer.java. I'll take a pass over all the files.

Eclipse is TODO.

I ran ant eclipse and it seemed to do the right thing already -I can see Arirang entries in the .classpath file that gets produced - I don't think there's anything to be done. I don't use Eclipse, though, so I can't be sure.

I added Maven config and an IntelliJ Arirang module test run configuration.

asfimport commented 11 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Thanks, Steve. I've added the missing license header to TestKoreanAnalyzer.java.

asfimport commented 11 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I have seen the Tokenizer also uses JFlex, but an older version as used for Lucene's other tokenizers (like StandardTokenizer). Can we add the ANT tasks like we have for StandardTokenizer to regenerate the source file from build.xml. Finally we should regenerate the Java files with the JFlex trunk version and compare with the one committed here (if there are differences).

asfimport commented 11 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Good points, Uwe. I'll look into this.

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Thanks, Steve. I've added the missing license header to TestKoreanAnalyzer.java.

I looked over the rest of the files, and the only things missing license headers are the dictionary files and the korean.properties file, all under src/resources/. I committed a license header to korean.properties.

I tried adding '#'-commented-out headers to the .dic files (a couple of them already have '######' and '//######' lines), but that triggered a test failure, so more work will need to be done to make the license headers inline in the dictionary files.

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Could you comment about the origins and authorship of org.apache.lucene.analysis.kr.utils.StringUtil in your tar file? I'm seeing a lot of authors in this file. Is this from Apache Commons Lang? Thanks!

I looked at the file content, and it's definitely from Apache Commons Lang (the class is named StringUtils there, renamed StringUtil here), circa early 2010, maybe with a little pulled in from another Commons Lang class.

I've eliminated StringUtil - it's almost all calls to StringUtils.split(String, separators) - its javadoc is:

/**
 * <p>Splits the provided text into an array, separators specified.
 * This is an alternative to using StringTokenizer.</p>
 *
 * <p>The separator is not included in the returned String array.
 * Adjacent separators are treated as one separator.
 * For more control over the split use the StrTokenizer class.</p>
 *
 * <p>A <code>null</code> input String returns <code>null</code>.
 * A <code>null</code> separatorChars splits on whitespace.</p>
 *
 * <pre>
 * StringUtil.split(null, *)         = null
 * StringUtil.split("", *)           = []
 * StringUtil.split("abc def", null) = ["abc", "def"]
 * StringUtil.split("abc def", " ")  = ["abc", "def"]
 * StringUtil.split("abc  def", " ") = ["abc", "def"]
 * StringUtil.split("ab:cd:ef", ":") = ["ab", "cd", "ef"]
 * </pre>
 *
 * `@param` str  the String to parse, may be null
 * `@param` separatorChars  the characters used as the delimiters,
 *  <code>null</code> splits on whitespace
 * `@return` an array of parsed Strings, <code>null</code> if null String input
 */

I'm replacing calls to this method with calls to String.split(regex), where regex is "[char]+", and char is the (in all cases singular) split character.

I'll commit the changes and the StringUtil.java removal in a little bit once I've got it compiling and the tests succeed.

asfimport commented 11 years ago

Commit Tag Bot (migrated from JIRA)

[lucene4956 commit] sarowe http://svn.apache.org/viewvc?view=revision&revision=1479362

LUCENE-4956: Remove o.a.l.analysis.kr.utils.StringUtil and all calls to it (mostly StringUtil.split, replaced with String.split)

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

This looks like a typo to me, in KoreanEnv.java - the second FILE_DICTIONARY should instead be FILE_EXTENSION:

/**
 * Initialize the default property values.
 */
private void initDefaultProperties() {
  defaults = new Properties();

  defaults.setProperty(FILE_SYLLABLE_FEATURE,"org/apache/lucene/analysis/kr/dic/syllable.dic");
  defaults.setProperty(FILE_DICTIONARY,"org/apache/lucene/analysis/kr/dic/dictionary.dic");
  defaults.setProperty(FILE_DICTIONARY,"org/apache/lucene/analysis/kr/dic/extension.dic");      
asfimport commented 11 years ago

Commit Tag Bot (migrated from JIRA)

[lucene4956 commit] sarowe http://svn.apache.org/viewvc?view=revision&revision=1479386

LUCENE-4956: fix typo

asfimport commented 11 years ago

Commit Tag Bot (migrated from JIRA)

[lucene4956 commit] sarowe http://svn.apache.org/viewvc?view=revision&revision=1479391

LUCENE-4956: Add license headers to dictionary files, and modify FileUtil.readLines() to ignore lines beginning with comment char '!'

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

I added license headers to the dictionary files, so AFAICT all files now have Apache License headers.

I've updated http://incubator.apache.org/ip-clearance/lucene-korean-analyzer.html - it looks ready to go to me. (Again, I can only the control the XML version of this, at http://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/lucene-korean-analyzer.xml, so it might be a day or so before the HTML version catches up.)

I think we're ready for the incubator-general vote. @cmoen, do you agree?

We don't need to wait for the vote result to continue making improvements, e.g. tabs->space, svn:eol-style->native, etc. - the vote email will point to the revision on the branch we think is vote-worthy: r1479391.

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

soomyung, I don't understand the following method in WordSpaceAnalyzer.java - what's the point of the method always returning false? (i.e.: if(true) return false;):

private boolean isNounPart(String str, int jstart) throws MorphException  {

  if(true) return false;

  for(int i=jstart-1;i>=0;i--) {      
    if(DictionaryUtil.getWordExceptVerb(str.substring(i,jstart+1))!=null)
      return true;
  }

  return false;
}

isNounPart() is only called from one method in the same class: findJosaEnd(snipt,jstart):

if(DictionaryUtil.existJosa(str) && !findNounWithinStr(snipt,i,i+2) && !isNounPart(snipt,jstart)) {
asfimport commented 11 years ago

Commit Tag Bot (migrated from JIRA)

[lucene4956 commit] sarowe http://svn.apache.org/viewvc?view=revision&revision=1479410

LUCENE-4956: - svn:eol-style -> native

asfimport commented 11 years ago

SooMyung Lee (migrated from JIRA)

@cmoen I'm sorry that I didn't reply to your comment on the last weekend! I'm seeing that @sarowe solved your problem. am I right? @sarowe I checked the method. isNounPart() is no more necessary. Spaces should be inserted between phrases in a korean sentence, but many people are confused in where inserting spaces.

The isNounPart() method examine if spaces should be inserted at a specific position only when a noun existing in the dictionary precede it. After testing, I found that the method is superfluous. I'm sorry not to correct the source code before contributing.

asfimport commented 11 years ago

Christian Moen (@cmoen) (migrated from JIRA)

I think we're ready for the incubator-general vote. @cmoen, do you agree?

+1

asfimport commented 11 years ago

Jack Krupansky (migrated from JIRA)

I am not really familiar with the "incubator-general vote". From looking at the legal clearance page, it sounds like the vote is simply "accepting the donation", as opposed to voting that the branch is ready to commit to trunk, correct?

I did a Jira search and found no previous references to "incubator-general vote" - from Google search I got the impression it was more related to podlings rather than simple code module contributions.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Jack, thats correct.

It is a vote for IP clearance. For example, Simon called an IP clearance vote on the incubator list for Kuromoji before we integrated it into Lucene.

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Hi Jack,

From http://incubator.apache.org/ip-clearance/, which is (quoting from that page):

Intellectual property clearance

One of the Incubator's roles is to ensure that proper attention is paid to intellectual property. From time to time, an external codebase is brought into the ASF that is not a separate incubating project but still represents a substantial contribution that was not developed within the ASF's source control system and on our public mailing lists. This is a short form of the Incubation checklist, designed to allow code to be imported with alacrity while still providing for oversight. [...] Once a PMC directly checks-in a filled-out short form, the Incubator PMC will need to approve the paper work after which point the receiving PMC is free to import the code.

The "short form" referred to above is an XML template, which I've completed for this code base, and which is at some (apparently regular?) interval converted to HTML (this is also linked from the above-linked IP clearance page as "Korean Analyzer"): http://incubator.apache.org/ip-clearance/lucene-korean-analyzer.html

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Yesterday I called a vote for this contribution on general@incubator.apache.org: http://mail-archives.apache.org/mod_mbox/incubator-general/201305.mbox/%3c7AD4D4E3-530B-41E3-8323-DA3D66A40E7E@apache.org%3e

asfimport commented 11 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Updates:

Korean analysis using field type text_kr seems to be doing the right thing out-of-the-box now, but some configuration options in the factories aren't working as of now. There are several other things that needs polishing up, but we're making progress.

asfimport commented 11 years ago

Edward J. Yoon (migrated from JIRA)

Great job!

asfimport commented 11 years ago

SooMyung Lee (migrated from JIRA)

Hi Christian, Thanks for your great work.

I'd like to ask you to modify the text_kr field type definition in schema.xml as follows

    <fieldType name="text_kr" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.KoreanTokenizerFactory"/>
        <filter class="solr.KoreanFilterFactory hasOrigin="true" hasCNoun="true"  bigrammable="true""/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_kr.txt"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.KoreanTokenizerFactory"/>
        <filter class="solr.KoreanFilterFactory hasOrigin="false" hasCNoun="false"  bigrammable="false""/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_kr.txt"/>
      </analyzer>      
    </fieldType>
asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Yesterday I called a vote for this contribution on general@incubator.apache.org: [http://mail-archives.apache.org/mod_mbox/incubator-general/201305.mbox/%3c7AD4D4E3-530B-41E3-8323-DA3D66A40E7E@apache.org%3e]

This vote has passed, so we're now free to incorporate this contribution into the code base when and as we see fit.

asfimport commented 11 years ago

SooMyung Lee (migrated from JIRA)

Cool! Thanks, Steve

asfimport commented 11 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Thanks, Steve & co.!

asfimport commented 11 years ago

Christian Moen (@cmoen) (migrated from JIRA)

Hello SooMyung,

Thanks for the above regarding field type. The general approach we have taken in Lucene is to do the same analysis at both index and query side. For example, the Japanese analyzer also has functionality to do compound splitting and we've discussed doing this one the index side only per default for field type text_ja, but we decided against it.

I've included your field type in the latest code I've checked in just now, but it's likely that we will change this in the future.

I'm wondering if you could help me with a few sample sentences that illustrates the various options KoreanFilter has. I'd like to add some test-cases for these to better understand the differences between them and to verify correct behaviour. Test-cases for this is also a useful way to document functionality in general. Thanks for any help with this!