[PATCH] Test case for FrenchAnalyzer [LUCENE-172]

asfimport commented 20 years ago

Hello,

following is a test case for the French Analyzer to help it get out of the sandbox :) Looks OK, only has some strange behavior with the minus sign. I included a slight modification of the Analyzer to better handle null parameters just in case of.

—

package org.apache.lucene.analysis.fr;

/* ====================================================================

The Apache Software License, Version 1.1 *
Copyright (c) 2001 The Apache Software Foundation. All rights
reserved. *
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met: *
1. Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer. *
1. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in
the documentation and/or other materials provided with the
distribution. *
1. The end-user documentation included with the redistribution,
if any, must include the following acknowledgment:
"This product includes software developed by the
Apache Software Foundation (http://www.apache.org/)."
Alternately, this acknowledgment may appear in the software itself,
if and wherever such third-party acknowledgments normally appear. *
1. The names "Apache" and "Apache Software Foundation" and
"Apache Lucene" must not be used to endorse or promote products
derived from this software without prior written permission. For
written permission, please contact apache@apache.org. *
1. Products derived from this software may not be called "Apache",
"Apache Lucene", nor may "Apache" appear in their name, without
prior written permission of the Apache Software Foundation. *
THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
SUCH DAMAGE.
==================================================================== *
This software consists of voluntary contributions made by many
individuals on behalf of the Apache Software Foundation. For more
information on the Apache Software Foundation, please see
<http://www.apache.org/>. */

import java.io.Reader; import java.io.StringReader;

import junit.framework.TestCase;

import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.Token; import org.apache.lucene.analysis.TokenStream;

/**

Test case for FrenchAnalyzer. *
@author Jean-FranÃ§ois Halleux
@version $version$ */

public class TestFrenchAnalyzer extends TestCase {

public void assertAnalyzesTo(Analyzer a, String input, String[] output)
    throws Exception {

    TokenStream ts = a.tokenStream("dummy", new StringReader

(input));

    for (int i = 0; i &lt;output.length; i++) {
        Token t = ts.next();
        assertNotNull(t);
        assertEquals(t.termText(), output[i]);
    }
    assertNull(ts.next());
    ts.close();
}

public void testAnalyzer() throws Exception {
    FrenchAnalyzer fa = new FrenchAnalyzer();

    // test null reader
    boolean iaeFlag = false;
    try {
        TokenStream ts = fa.tokenStream("dummy", null);
    } catch (IllegalArgumentException iae) {
        iaeFlag = true;
    }
    assertEquals(iaeFlag, true);

    // test null fieldname
    iaeFlag = true;
    try {
        TokenStream ts = fa.tokenStream(null, new StringReader

("dummy")); } catch (IllegalArgumentException iae) { iaeFlag = true; } assertEquals(iaeFlag, true);

    assertAnalyzesTo(fa, "", new String[] {
    });

    assertAnalyzesTo(
        fa,
        "chien chat cheval",
        new String[] { "chien", "chat", "cheval" });

    assertAnalyzesTo(
        fa,
        "chien CHAT CHEVAL",
        new String[] { "chien", "chat", "cheval" });

    assertAnalyzesTo(
        fa,
        "  chien  ,? + = -  CHAT /: &gt; CHEVAL",
        new String[] { "chien", "chat", "cheval" });

    assertAnalyzesTo(fa, "chien++", new String[] { "chien" });

    assertAnalyzesTo(
        fa,
        "mot \"entreguillemet\"",
        new String[] { "mot", "entreguillemet" });

    // let's do some french specific tests now  

    // 1. couldn't resist
    // I would expect this to stay one term as in French the minus

sign // is often used for composing words assertAnalyzesTo( fa, "Jean-FranÃ§ois", new String[] { "jean", "franÃ§ois" });

    // 2. stopwords
    assertAnalyzesTo(
        fa,
        "le la chien les aux chat du des Ã  cheval",
        new String[] { "chien", "chat", "cheval" });

    // some nouns and adjectives
    assertAnalyzesTo(
        fa,
        "lances chismes habitable chiste Ã©lÃ©ments captifs",
        new String[] {
            "lanc",
            "chism",
            "habit",
            "chist",
            "Ã©lÃ©ment",
            "captif" });

    // some verbs
    assertAnalyzesTo(
        fa,
        "finissions souffrirent rugissante",
        new String[] { "fin", "souffr", "rug" });

    // some everything else
    // aujourd'hui stays one term which is OK
    assertAnalyzesTo(
        fa,
        "C3PO aujourd'hui oeuf Ã¯Ã¢Ã¶Ã»Ã Ã¤ anticonstitutionnellement

Java++", new String[] { "c3po", "aujourd'hui", "oeuf", "Ã¯Ã¢Ã¶Ã»Ã Ã¤", "anticonstitutionnel", "jav" });

    // some more everything else
    // here 1940-1945 stays as one term, 1940:1945 not ?
    assertAnalyzesTo(
        fa,
        "33Bis 1940-1945 1940:1945 (---i+++)\*",
        new String[] { "33bis", "1940-

1945", "1940", "1945", "i" });

}

—

package org.apache.lucene.analysis.fr;

import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.LowerCaseFilter; import org.apache.lucene.analysis.StopFilter; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.standard.StandardFilter; import org.apache.lucene.analysis.standard.StandardTokenizer; import java.io.File; import java.io.Reader; import java.util.Hashtable; import org.apache.lucene.analysis.de.WordlistLoader;

/**

Analyzer for french language. Supports an external list of stopwords (words that
will not be indexed at all) and an external list of exclusions (word that will
not be stemmed, but indexed).
A default set of stopwords is used unless an other list is specified, the
exclusionlist is empty by default. *
@author Patrick Talbot (based on Gerhard Schwarz work for German)
@version $Id: FrenchAnalyzer.java,v 1.1 2004/01/20 10:07:01 ehatcher Exp $ */ public final class FrenchAnalyzer extends Analyzer {

/**
Extended list of typical french stopwords. */ private String[] FRENCH_STOP_WORDS = {

"a", "afin", "ai", "ainsi", "aprÃ¨s", "attendu", "au", "aujourd", "auquel ", "aussi",

"autre", "autres", "aux", "auxquelles", "auxquels", "avait", "avant", "a vec", "avoir",

"c", "car", "ce", "ceci", "cela", "celle", "celles", "celui", "cependant ", "certain",

"certaine", "certaines", "certains", "ces", "cet", "cette", "ceux", "che z", "ci",

"combien", "comme", "comment", "concernant", "contre", "d", "dans", "de" , "debout",

"dedans", "dehors", "delÃ ", "depuis", "derriÃ¨re", "des", "dÃ©sormais", "d esquelles",

"desquels", "dessous", "dessus", "devant", "devers", "devra", "divers", "diverse",

"diverses", "doit", "donc", "dont", "du", "duquel", "durant", "dÃ¨s", "el le", "elles",

"en", "entre", "environ", "est", "et", "etc", "etre", "eu", "eux", "exce ptÃ©", "hormis",

"hors", "hÃ©las", "hui", "il", "ils", "j", "je", "jusqu", "jusque", "l", "la", "laquelle",

"le", "lequel", "les", "lesquelles", "lesquels", "leur", "leurs", "lorsq ue", "lui", "lÃ ",

"ma", "mais", "malgrÃ©", "me", "merci", "mes", "mien", "mienne", "miennes ", "miens", "moi",

"moins", "mon", "moyennant", "mÃªme", "mÃªmes", "n", "ne", "ni", "non", "n os", "notre",

"nous", "nÃ©anmoins", "nÃ´tre", "nÃ´tres", "on", "ont", "ou", "outre", "oÃ¹" , "par", "parmi",

"partant", "pas", "passÃ©", "pendant", "plein", "plus", "plusieurs", "pou r", "pourquoi",

"proche", "prÃ¨s", "puisque", "qu", "quand", "que", "quel", "quelle", "qu elles", "quels",

"qui", "quoi", "quoique", "revoici", "revoilÃ ", "s", "sa", "sans", "sauf ", "se", "selon",

"seront", "ses", "si", "sien", "sienne", "siennes", "siens", "sinon", "s oi", "soit",

"son", "sont", "sous", "suivant", "sur", "ta", "te", "tes", "tien", "tie nne", "tiennes",

"tiens", "toi", "ton", "tous", "tout", "toute", "toutes", "tu", "un", "u ne", "va", "vers",

"voici", "voilÃ ", "vos", "votre", "vous", "vu", "vÃ´tre", "vÃ´tres", "y", "Ã ", "Ã§a", "Ã¨s", "Ã©tÃ©", "Ãªtre", "Ã´" };

/**
Contains the stopwords used with the StopFilter. */ private Hashtable stoptable = new Hashtable(); /**
Contains words that should be indexed but not stemmed. */ private Hashtable excltable = new Hashtable();

/**
Builds an analyzer. */ public FrenchAnalyzer() { stoptable = StopFilter.makeStopTable( FRENCH_STOP_WORDS ); }

/**
Builds an analyzer with the given stop words. */ public FrenchAnalyzer( String[] stopwords ) { stoptable = StopFilter.makeStopTable( stopwords ); }

/**
Builds an analyzer with the given stop words. */ public FrenchAnalyzer( Hashtable stopwords ) { stoptable = stopwords; }

/**
Builds an analyzer with the given stop words. */ public FrenchAnalyzer( File stopwords ) { stoptable = WordlistLoader.getWordtable( stopwords ); }

/**
Builds an exclusionlist from an array of Strings. */ public void setStemExclusionTable( String[] exclusionlist ) { excltable = StopFilter.makeStopTable( exclusionlist ); } /**
Builds an exclusionlist from a Hashtable. */ public void setStemExclusionTable( Hashtable exclusionlist ) { excltable = exclusionlist; } /**
Builds an exclusionlist from the words contained in the given file. */ public void setStemExclusionTable( File exclusionlist ) { excltable = WordlistLoader.getWordtable( exclusionlist ); }

/**
Creates a TokenStream which tokenizes all the text in the provided Reader. *
@return A TokenStream build from a StandardTokenizer filtered with

StandardFilter, StopFilter, FrenchStemFilter and LowerCaseFilter */ public final TokenStream tokenStream( String fieldName, Reader reader ) {

if (fieldName==null) throw new IllegalArgumentException

("fieldName must not be null"); if (reader==null) throw new IllegalArgumentException("reader must not be null");

TokenStream result = new StandardTokenizer( reader );
result = new StandardFilter( result );
result = new StopFilter( result, stoptable );
result = new FrenchStemFilter( result, excltable );
// Convert to lowercase after stemming!
result = new LowerCaseFilter( result );
return result;

} }

Migrated from LUCENE-172 by Jean-François Halleux, resolved May 27 2006 Environment:

Operating System: other
Platform: Other

Attachments: ASF.LICENSE.NOT.GRANTED--FrenchAnalyzer.java, ASF.LICENSE.NOT.GRANTED--patch2.txt, ASF.LICENSE.NOT.GRANTED--TestFrenchAnalyzer.java

asfimport commented 20 years ago

Erik Hatcher (@erikhatcher) (migrated from JIRA)

thanks for the test! the test fails for me though. i have committed the test file and updated analyzer to the sandbox though. i look forward to a patch that fixes the test case :)

asfimport commented 20 years ago

Jean-François Halleux (migrated from JIRA)

Looks like special French characters got transformed to something weird when you copied the source to your local CVS. For me at least, they appear well in Bugzilla. They are in the 0-255 range. The previous version of FrenchAnalyzer in CVS had them right.

JF

asfimport commented 20 years ago

Erik Hatcher (@erikhatcher) (migrated from JIRA)

Could you please attach a patch file (cvs diff -u) or the entire file - as an attachment - so nothing can get lost in copy/paste?

asfimport commented 20 years ago

Jean-François Halleux (migrated from JIRA)

Created an attachment (id=10072) This attachement contains a patch to your latest commits. Here the test case runs fine.

asfimport commented 20 years ago

Erik Hatcher (@erikhatcher) (migrated from JIRA)

Sorry for my incompetence, but I cannot get the patch files to apply appropriately:

patch -p0 < patch.txt (Stripping trailing CRs from patch.) patching file java/org/apache/lucene/analysis/fr/FrenchAnalyzer.java Hunk #1 FAILED at 1. 1 out of 1 hunk FAILED – saving rejects to file java/org/apache/lucene/analysis/fr/ FrenchAnalyzer.java.rej (Stripping trailing CRs from patch.)

Could you please attach the full files and I will simply replace my local copies and commit them?
Thanks!

asfimport commented 20 years ago

Jean-François Halleux (migrated from JIRA)

Created an attachment (id=10073) the French Analyzer file

asfimport commented 20 years ago

Jean-François Halleux (migrated from JIRA)

Created an attachment (id=10074) The test case

asfimport commented 20 years ago

Erik Hatcher (@erikhatcher) (migrated from JIRA)

Test still failing for me after applying your latest patch. The differences seem pretty dramatic - be sure to use CVS HEAD. I've committed what you sent, but I'm getitng this failure:

test: [junit] Testsuite: org.apache.lucene.analysis.fr.TestFrenchAnalyzer [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 0.487 sec

[junit] Testcase: testAnalyzer(org.apache.lucene.analysis.fr.TestFrenchAnalyzer):   FAILED
[junit] expected:&lt;...?...&gt; but was:&lt;...?...&gt;
[junit] junit.framework.ComparisonFailure: expected:&lt;...?...&gt; but was:&lt;...?...&gt;
[junit]     at

org.apache.lucene.analysis.fr.TestFrenchAnalyzer.assertAnalyzesTo(TestFrenchAnalyzer.java:84) [junit] at org.apache.lucene.analysis.fr.TestFrenchAnalyzer.testAnalyzer(TestFrenchAnalyzer.java:141) [junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [junit] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) [junit] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

asfimport commented 20 years ago

Jean-François Halleux (migrated from JIRA)

Strange...

Just did a full checkout of lucene and sandbox, run the test and it worked properly. Could there be a problem with the locale? Anybody can try this?

Jeff

asfimport commented 20 years ago

Erik Hatcher (@erikhatcher) (migrated from JIRA)

well, if it works for you, i'll close this issue. i'm far from being I18N savvy, so it is likely a locale issue on my end.... although surely the test case can be made to pass for me somehow?

apache / lucene

[PATCH] Test case for FrenchAnalyzer [LUCENE-172] #1250