Add example settings to Korean analyzer components' javadocs [LUCENE-8453]

asfimport commented 6 years ago

Korean analyzer (nori) javadoc needs example schema settings.

I'll create a patch.

Migrated from LUCENE-8453 by Tomoko Uchida (@mocobeta), resolved Aug 11 2018 Linked issues:

SOLR-12255

asfimport commented 6 years ago

Tomoko Uchida (@mocobeta) (migrated from JIRA)

Created a PR. It is based on Kuromoji's examples.

https://github.com/apache/lucene-solr/pull/434

Note: I've tested all parameters in this example schemas with CustomAnalyzer, but not tested with Solr yet. Check the XML settings with Solr, please.

asfimport commented 6 years ago

Tomoko Uchida (@mocobeta) (migrated from JIRA)

And, I think it would be better if Korean natives check that example values are good as default :)

asfimport commented 6 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I added the solr schema fragment to the solr issue. Works for me: SOLR-12655

Your example is missing lowercasing (like the analyzer does), so western text is correctly normalized.

asfimport commented 6 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

The full schema snippet is that is identical to default KoreanAnalyzer as shipped in Lucene:

<fieldType name="text_ko" class="solr.TextField" >
  <analyzer>
    <!-- decompoundMode: mixed (is keep original term and add all decompounded terms), discard (default, removes the compound form, only keeps the parts), none (no decompounding) -->
    <tokenizer class="solr.KoreanTokenizerFactory" decompoundMode="discard" outputUnknownUnigrams="false"/>
    <!-- removes some part of speech stuff like EOMI (Pos.E) -->
    <filter class="solr.KoreanPartOfSpeechStopFilterFactory" />
    <!-- Replaces term text with the Hangul transcription of Hanja characters, if applicable: -->
    <filter class="solr.KoreanReadingFormFilterFactory" />
    <filter class="solr.LowerCaseFilterFactory" />
  </analyzer>
</fieldType>

asfimport commented 6 years ago

Tomoko Uchida (@mocobeta) (migrated from JIRA)

KoreanAnalyzer discards some parameters (for example, KoreanTokenizerFactory has additional parameters "userDictionary" and "userDictionaryEncoding".) I think Javadoc examples should include all available parameters so my example settings include all parameters which are accepted by TokenizerFactory/TokenFilterFactoys.

About LowerCaseFilterFactory, of course it is needed in complete Analyzer settings,

I "feel" Javadoc example should focus on the targeted component only (like Kuromoji example settings below.)

https://lucene.apache.org/core/7_4_0/analyzers-kuromoji/org/apache/lucene/analysis/ja/JapanesePartOfSpeechStopFilterFactory.html

asfimport commented 6 years ago

Tomoko Uchida (@mocobeta) (migrated from JIRA)

So here are my proposal for javadoc's example settings (my pull request) :)

For KoreanTokenizerFactory:

<fieldType name="text_ko" class="solr.TextField">
   <analyzer>
     <tokenizer class="solr.KoreanTokenizerFactory"
       decompoundMode="discard"
       userDictionary="user.txt"
       userDictionaryEncoding="UTF-8"
       outputUnknownUnigrams="false"
     />
  </analyzer>
 </fieldType>

For KoreanPartOfSpeechStopFilterFactory:

<fieldType name="text_ko" class="solr.TextField">
    <analyzer>
      <tokenizer class="solr.KoreanTokenizerFactory"/>
      <filter class="solr.KoreanPartOfSpeechStopFilterFactory"
              tags="E,J"/>
    </analyzer>
 </fieldType>

For KoreanReadingFormFilterFactory:

<fieldType name="text_ko" class="solr.TextField">
   <analyzer>
     <tokenizer class="solr.KoreanTokenizerFactory"/>
     <filter class="solr.KoreanReadingFormFilterFactory"/>
   </analyzer>
 </fieldType>

Update: Added brief descriptions for each parameter (please see the pull request,) though unfortunately, Kuromoji's documentation lacks those.

asfimport commented 6 years ago

Tomoko Uchida (@mocobeta) (migrated from JIRA)

Slightly off topic, feel free to ignore, but I think Solr example settings should be removed from TokenizerFactory/TokenFilterFactory/CharFilterFactory documentation. I suppose there may be historical reasons, so I followed the convention, but it is not reasonable to add Solr schema examples here. Not XML schema examples, but parameter descriptions are needed to each Factory documentation.

asfimport commented 6 years ago

Tomoko Uchida (@mocobeta) (migrated from JIRA)

I've tested those three settings with Solr 7.4.0, works for me. (I copied lucene-analyzers-nori-7.4.0.jar and user dictionary file from lucene distribution package to solr lib directory.)

asfimport commented 6 years ago

Tomoko Uchida (@mocobeta) (migrated from JIRA)

I think this pull request is almost ready to merge. Could anyone take care this? I believe documentation for analyzer components is very important & a good starting point to newbies. :)

asfimport commented 6 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

+1. I will merge it soon!

Slightly off topic, feel free to ignore, but I think Solr example settings should be removed from TokenizerFactory/TokenFilterFactory/CharFilterFactory documentation. I suppose there may be historical reasons, so I followed the convention, but it is not reasonable to add Solr schema examples here. Not XML schema examples, but parameter descriptions are needed to each Factory documentation.

There is an issue open already (I think, can't find it now). I agree, the XML snippets should go away. Instead we can add some Javadoc tag for this like `@factoryProp name description`. This is much better. We should also document the SPI name of each factory.

asfimport commented 6 years ago

Tomoko Uchida (@mocobeta) (migrated from JIRA)

Thank you @uschindler !

and, thanks for your explanation.

There is an issue open already (I think, can't find it now). I agree, the XML snippets should go away. Instead we can add some Javadoc tag for this like @factoryProp name description. This is much better. We should also document the SPI name of each factory.

asfimport commented 6 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

No problem. I merged already. Just running document-linter to verify correctness of Javadocs.

asfimport commented 6 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Another idea: To make the propertie sof all analyzers easily available for inspection by the APIs in Solr, we may add runtime annotations to those classes, describing the properties. Just an idea.

asfimport commented 6 years ago

ASF subversion and git services (migrated from JIRA)

Commit e9addea0871a28517c5202e9d12969719d20c90e in lucene-solr's branch refs/heads/master from @uschindler https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e9addea

Merge branch 'jira/lucene-8453' of https://github.com/mocobeta/lucene-solr-mirror LUCENE-8453: Add documentation to analysis factories of Korean (Nori) analyzer module This closes #434

asfimport commented 6 years ago

ASF subversion and git services (migrated from JIRA)

Commit d8ecf976124eb519e1f8c66e6749e246976a95d9 in lucene-solr's branch refs/heads/branch_7x from @uschindler https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d8ecf97

Merge branch 'jira/lucene-8453' of https://github.com/mocobeta/lucene-solr-mirror LUCENE-8453: Add documentation to analysis factories of Korean (Nori) analyzer module This closes #434

asfimport commented 6 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Thanks @Tomoko Uchida!

asfimport commented 6 years ago

Tomoko Uchida (@mocobeta) (migrated from JIRA)

It may not be good manners to add comments to closed issue, but I'd like to leave a reminder for myself.

Another idea: To make the propertie sof all analyzers easily available for inspection by the APIs in Solr, we may add runtime annotations to those classes, describing the properties. Just an idea.

I like the idea, it would be nice that some kind of properties management/discovery mechanism (I have no concrete implementation image, just a vague concept) is equipped in {Tokenizer|CharFilter|TokenFilter}Factorys.

It will be handy for documentation and Solr, and also for CustomAnalyser (I sometimes use it for my nlp projects.)

I'll try it, not soon, after I'll have finished current ongoing projects.

apache / lucene

Add example settings to Korean analyzer components' javadocs [LUCENE-8453] #9499