apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.73k stars 1.04k forks source link

[PATCH] Test case for FrenchAnalyzer [LUCENE-172] #1250

Closed asfimport closed 18 years ago

asfimport commented 20 years ago

Hello,

following is a test case for the French Analyzer to help it get out of the sandbox :) Looks OK, only has some strange behavior with the minus sign. I included a slight modification of the Analyzer to better handle null parameters just in case of.

package org.apache.lucene.analysis.fr;

/* ====================================================================

import java.io.Reader; import java.io.StringReader;

import junit.framework.TestCase;

import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.Token; import org.apache.lucene.analysis.TokenStream;

/**

public class TestFrenchAnalyzer extends TestCase {

public void assertAnalyzesTo(Analyzer a, String input, String[] output)
    throws Exception {

    TokenStream ts = a.tokenStream("dummy", new StringReader

(input));

    for (int i = 0; i <output.length; i++) {
        Token t = ts.next();
        assertNotNull(t);
        assertEquals(t.termText(), output[i]);
    }
    assertNull(ts.next());
    ts.close();
}

public void testAnalyzer() throws Exception {
    FrenchAnalyzer fa = new FrenchAnalyzer();

    // test null reader
    boolean iaeFlag = false;
    try {
        TokenStream ts = fa.tokenStream("dummy", null);
    } catch (IllegalArgumentException iae) {
        iaeFlag = true;
    }
    assertEquals(iaeFlag, true);

    // test null fieldname
    iaeFlag = true;
    try {
        TokenStream ts = fa.tokenStream(null, new StringReader

("dummy")); } catch (IllegalArgumentException iae) { iaeFlag = true; } assertEquals(iaeFlag, true);

    assertAnalyzesTo(fa, "", new String[] {
    });

    assertAnalyzesTo(
        fa,
        "chien chat cheval",
        new String[] { "chien", "chat", "cheval" });

    assertAnalyzesTo(
        fa,
        "chien CHAT CHEVAL",
        new String[] { "chien", "chat", "cheval" });

    assertAnalyzesTo(
        fa,
        "  chien  ,? + = -  CHAT /: > CHEVAL",
        new String[] { "chien", "chat", "cheval" });

    assertAnalyzesTo(fa, "chien++", new String[] { "chien" });

    assertAnalyzesTo(
        fa,
        "mot \"entreguillemet\"",
        new String[] { "mot", "entreguillemet" });

    // let's do some french specific tests now  

    // 1. couldn't resist
    // I would expect this to stay one term as in French the minus 

sign // is often used for composing words assertAnalyzesTo( fa, "Jean-François", new String[] { "jean", "françois" });

    // 2. stopwords
    assertAnalyzesTo(
        fa,
        "le la chien les aux chat du des à cheval",
        new String[] { "chien", "chat", "cheval" });

    // some nouns and adjectives
    assertAnalyzesTo(
        fa,
        "lances chismes habitable chiste éléments captifs",
        new String[] {
            "lanc",
            "chism",
            "habit",
            "chist",
            "élément",
            "captif" });

    // some verbs
    assertAnalyzesTo(
        fa,
        "finissions souffrirent rugissante",
        new String[] { "fin", "souffr", "rug" });

    // some everything else
    // aujourd'hui stays one term which is OK
    assertAnalyzesTo(
        fa,
        "C3PO aujourd'hui oeuf ïâöûàä anticonstitutionnellement 

Java++", new String[] { "c3po", "aujourd'hui", "oeuf", "ïâöûàä", "anticonstitutionnel", "jav" });

    // some more everything else
    // here 1940-1945 stays as one term, 1940:1945 not ?
    assertAnalyzesTo(
        fa,
        "33Bis 1940-1945 1940:1945 (---i+++)\*",
        new String[] { "33bis", "1940-

1945", "1940", "1945", "i" });

}

}

package org.apache.lucene.analysis.fr;

import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.LowerCaseFilter; import org.apache.lucene.analysis.StopFilter; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.standard.StandardFilter; import org.apache.lucene.analysis.standard.StandardTokenizer; import java.io.File; import java.io.Reader; import java.util.Hashtable; import org.apache.lucene.analysis.de.WordlistLoader;

/**


Migrated from LUCENE-172 by Jean-François Halleux, resolved May 27 2006 Environment:

Operating System: other
Platform: Other

Attachments: ASF.LICENSE.NOT.GRANTED--FrenchAnalyzer.java, ASF.LICENSE.NOT.GRANTED--patch2.txt, ASF.LICENSE.NOT.GRANTED--TestFrenchAnalyzer.java

asfimport commented 20 years ago

Erik Hatcher (@erikhatcher) (migrated from JIRA)

thanks for the test! the test fails for me though. i have committed the test file and updated analyzer to the sandbox though. i look forward to a patch that fixes the test case :)

asfimport commented 20 years ago

Jean-François Halleux (migrated from JIRA)

Looks like special French characters got transformed to something weird when you copied the source to your local CVS. For me at least, they appear well in Bugzilla. They are in the 0-255 range. The previous version of FrenchAnalyzer in CVS had them right.

JF

asfimport commented 20 years ago

Erik Hatcher (@erikhatcher) (migrated from JIRA)

Could you please attach a patch file (cvs diff -u) or the entire file - as an attachment - so nothing can get lost in copy/paste?

asfimport commented 20 years ago

Jean-François Halleux (migrated from JIRA)

Created an attachment (id=10072) This attachement contains a patch to your latest commits. Here the test case runs fine.

asfimport commented 20 years ago

Erik Hatcher (@erikhatcher) (migrated from JIRA)

Sorry for my incompetence, but I cannot get the patch files to apply appropriately:

patch -p0 < patch.txt (Stripping trailing CRs from patch.) patching file java/org/apache/lucene/analysis/fr/FrenchAnalyzer.java Hunk #1 FAILED at 1. 1 out of 1 hunk FAILED – saving rejects to file java/org/apache/lucene/analysis/fr/ FrenchAnalyzer.java.rej (Stripping trailing CRs from patch.)

Could you please attach the full files and I will simply replace my local copies and commit them?
Thanks!

asfimport commented 20 years ago

Jean-François Halleux (migrated from JIRA)

Created an attachment (id=10073) the French Analyzer file

asfimport commented 20 years ago

Jean-François Halleux (migrated from JIRA)

Created an attachment (id=10074) The test case

asfimport commented 20 years ago

Erik Hatcher (@erikhatcher) (migrated from JIRA)

Test still failing for me after applying your latest patch. The differences seem pretty dramatic - be sure to use CVS HEAD. I've committed what you sent, but I'm getitng this failure:

test: [junit] Testsuite: org.apache.lucene.analysis.fr.TestFrenchAnalyzer [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 0.487 sec

[junit] Testcase: testAnalyzer(org.apache.lucene.analysis.fr.TestFrenchAnalyzer):   FAILED
[junit] expected:&lt;...?...&gt; but was:&lt;...?...&gt;
[junit] junit.framework.ComparisonFailure: expected:&lt;...?...&gt; but was:&lt;...?...&gt;
[junit]     at 

org.apache.lucene.analysis.fr.TestFrenchAnalyzer.assertAnalyzesTo(TestFrenchAnalyzer.java:84) [junit] at org.apache.lucene.analysis.fr.TestFrenchAnalyzer.testAnalyzer(TestFrenchAnalyzer.java:141) [junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [junit] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) [junit] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

asfimport commented 20 years ago

Jean-François Halleux (migrated from JIRA)

Strange...

Just did a full checkout of lucene and sandbox, run the test and it worked properly. Could there be a problem with the locale? Anybody can try this?

Jeff

asfimport commented 20 years ago

Erik Hatcher (@erikhatcher) (migrated from JIRA)

well, if it works for you, i'll close this issue. i'm far from being I18N savvy, so it is likely a locale issue on my end.... although surely the test case can be made to pass for me somehow?