[PATCH] GermanAnalyzer problems with upper/lower case [LUCENE-87]

apache / lucene

Apache Lucene open-source search software

https://lucene.apache.org/

Apache License 2.0

2.67k stars 1.03k forks source link

[PATCH] GermanAnalyzer problems with upper/lower case [LUCENE-87] #1165

Closed asfimport closed 18 years ago

asfimport commented 21 years ago

Hello!

If noticed some strange problems of the german analyzer when using field search for texts consisting of more than one word. For example, I had to documents in the search index, one had a field set to "Anfrage von mir", the other one had it set to "Ticket von mir". While the search for "fieldname:anfrage" returned the expected document, "fieldname:ticket" did not return the document. After removing the special treatment of upper case words in the GermanStemmer, it worked properly.

All the best Philipp

Migrated from LUCENE-87 by Philipp Meister, resolved May 27 2006 Environment:

Operating System: All
Platform: PC

Attachments: ASF.LICENSE.NOT.GRANTED--CorrectGermanAnalyzer.java, ASF.LICENSE.NOT.GRANTED--CorrectGermanStemmer.java, ASF.LICENSE.NOT.GRANTED--GermanAnalyzer.diff, ASF.LICENSE.NOT.GRANTED--lucene_german_stemmer.diff

asfimport commented 21 years ago

Mirko Ebert (migrated from JIRA)

I have an additional example: the result of the query "Aehnelt" is hot equal to result of the query "aehnelt"

asfimport commented 21 years ago

Philipp Meister (migrated from JIRA)

Created an attachment (id=6542) Analyzer that ignores upper/lowercase

asfimport commented 21 years ago

Philipp Meister (migrated from JIRA)

Created an attachment (id=6543) Stemmer that ignores upper/lowercase

asfimport commented 21 years ago

Philipp Meister (migrated from JIRA)

Mirko, the two files I have attached are copies of the original classes except of the fact that they ingore the difference between lowercase and uppercase.

asfimport commented 20 years ago

Daniel Naber (migrated from JIRA)

*** Bug 12569 has been marked as a duplicate of this bug. ***

asfimport commented 20 years ago

Daniel Naber (migrated from JIRA)

Created an attachment (id=11027) no special uppercase handling

asfimport commented 20 years ago

Daniel Naber (migrated from JIRA)

I added an attachment that does the same as attachment 6543, only that it's a clean patch against the latest CVS version.

asfimport commented 20 years ago

Daniel Naber (migrated from JIRA)

Here's a patch that fixes the bug and does a bit more, obsoleting all other attachments to this report. What it does:

GermanAnalyzer.java: -use LowerCaseFilter -Hashtable -> HashSet, deprecate the old methods

GermanStemmer.java: -no special handling for uppercase words, this confuses people more than it helps

WordListLoader: -avoid silent failure for null filenames -trim() the lines from the stopword file -simplify implementation, using HashSet add instead of array copying -add a TODO: this isn't specific for German, should be moved

I hope this can be applied before 1.4 is released.

asfimport commented 20 years ago

Daniel Naber (migrated from JIRA)

Created an attachment (id=11050) bug fix + other small enhancements, see my comment

asfimport commented 20 years ago

Otis Gospodnetic (migrated from JIRA)

Daniel, Thanks for the patch. Before I apply it, could you please explain to me why it is okay to ignore upper/lower case characters for a German language stemmer? Nouns are upper-cased in German, so wouldn't the case have a special meaning to consider before stemming a word?

Furthermore, would you happen to know whether this GermanStemmer is superior or different than the 2 Snowball stemmers for German?

Thanks.

asfimport commented 20 years ago

Daniel Naber (migrated from JIRA)

Otis,

the problem with uppercase is that any word at the beginning of a sentence starts with an uppercase character (just like in English). So unless you've got a sophisticated sentence boundary detection you cannot conclude that a word is a noun just because it starts with an uppercase character.

Comment #2 had an example: "Ã¤hnelt" (a verb) vs. "Ãhnelt" (the same verb, but appearing at the beginning of a sentence – which is okay).

I didn't have a closer look at the Snowball stemmers, so I cannot comment on that.

asfimport commented 20 years ago

Otis Gospodnetic (migrated from JIRA)

The start of sentence vs. noun comment - I see.

I have make this change..... although it breaks backwards-compatibility of German Analyzer.