latin text analysis [LUCENE-4229]

asfimport commented 12 years ago

Hi

a workmate and I played a bit with latin text analysis and created two filter for the solr trunk version. One filter is designed for number conversion like 'iv' -> '4', 'v' -> '5', 'vi' -> '6' ... The second filter is a stemmer for the most common suffixe.

The following schema configuration could be a usecase for latin stemming.

&lt;fieldType name="text_latin" class="solr.TextField" positionIncrementGap="100"&gt;
    &lt;analyzer&gt;
        &lt;tokenizer class="solr.StandardTokenizerFactory"/&gt;
        &lt;filter class="org.apache.solr.analysis.LatinNumberConvertFilterFactory" strictMode="true"/&gt;
        &lt;filter class="solr.KeywordMarkerFilterFactory" protected="latin_protwords.txt" /&gt;
        &lt;filter class="org.apache.solr.analysis.LatinStemFilterFactory" /&gt;
    &lt;/analyzer&gt;
&lt;/fieldType&gt;

LatinNumberConvertFilterFactory has one property "strictMode" (default is false). This boolean indicates in which way the computation of the value is done, because not all letter combination are "valid" numbers. With strictMode="true" the output of "ic" is "ic"; With strictMode="false" the output of "ic" is "99" The LatinStemFilterFactory generates for each input token two output token. the first stemmed as noun and the second stemmed as verb. Both filter are aware of the KeywordMarkerFilterFactory.

I have attached the svn patch for both filter. In addition I attached to zip files that are needed by filter tests (TestLatinNumberConvertFilter, TestLatinStemFilter). I am sorry for that but i did not find the option to include them into the patch, if there is one.

The image latin_analysis.png is an example of the analysis done with the configuration above. For this test we used the jar file latin.analysis.jar

Have fun with latin text analysis. It would be great to get some feedback.

Migrated from LUCENE-4229 by Markus Klose, 2 votes Attachments: latin_analysis.png, latin.analysis.jar, latinNumberTestData.zip, latinTestData.zip, SOLR-3630.patch (versions: 2)

asfimport commented 12 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I moved this one to Lucene as it is a new Lucene feature.

Funny number converter :-)

asfimport commented 12 years ago

Markus Klose (migrated from JIRA)

fix encoding issue in patch file

apache / lucene

latin text analysis [LUCENE-4229] #5301