Closed asfimport closed 18 years ago
Mirko Ebert (migrated from JIRA)
I have an additional example: the result of the query "Aehnelt" is hot equal to result of the query "aehnelt"
Philipp Meister (migrated from JIRA)
Created an attachment (id=6542) Analyzer that ignores upper/lowercase
Philipp Meister (migrated from JIRA)
Created an attachment (id=6543) Stemmer that ignores upper/lowercase
Philipp Meister (migrated from JIRA)
Mirko, the two files I have attached are copies of the original classes except of the fact that they ingore the difference between lowercase and uppercase.
Daniel Naber (migrated from JIRA)
*** Bug 12569 has been marked as a duplicate of this bug. ***
Daniel Naber (migrated from JIRA)
Created an attachment (id=11027) no special uppercase handling
Daniel Naber (migrated from JIRA)
I added an attachment that does the same as attachment 6543, only that it's a clean patch against the latest CVS version.
Daniel Naber (migrated from JIRA)
Here's a patch that fixes the bug and does a bit more, obsoleting all other attachments to this report. What it does:
GermanAnalyzer.java: -use LowerCaseFilter -Hashtable -> HashSet, deprecate the old methods
GermanStemmer.java: -no special handling for uppercase words, this confuses people more than it helps
WordListLoader: -avoid silent failure for null filenames -trim() the lines from the stopword file -simplify implementation, using HashSet add instead of array copying -add a TODO: this isn't specific for German, should be moved
I hope this can be applied before 1.4 is released.
Daniel Naber (migrated from JIRA)
Created an attachment (id=11050) bug fix + other small enhancements, see my comment
Otis Gospodnetic (migrated from JIRA)
Daniel, Thanks for the patch. Before I apply it, could you please explain to me why it is okay to ignore upper/lower case characters for a German language stemmer? Nouns are upper-cased in German, so wouldn't the case have a special meaning to consider before stemming a word?
Furthermore, would you happen to know whether this GermanStemmer is superior or different than the 2 Snowball stemmers for German?
Thanks.
Daniel Naber (migrated from JIRA)
Otis,
the problem with uppercase is that any word at the beginning of a sentence starts with an uppercase character (just like in English). So unless you've got a sophisticated sentence boundary detection you cannot conclude that a word is a noun just because it starts with an uppercase character.
Comment #2
had an example: "ähnelt" (a verb) vs. "Ãhnelt" (the same verb, but
appearing at the beginning of a sentence – which is okay).
I didn't have a closer look at the Snowball stemmers, so I cannot comment on that.
Otis Gospodnetic (migrated from JIRA)
The start of sentence vs. noun comment - I see.
I have make this change..... although it breaks backwards-compatibility of German Analyzer.
Hello!
If noticed some strange problems of the german analyzer when using field search for texts consisting of more than one word. For example, I had to documents in the search index, one had a field set to "Anfrage von mir", the other one had it set to "Ticket von mir". While the search for "fieldname:anfrage" returned the expected document, "fieldname:ticket" did not return the document. After removing the special treatment of upper case words in the GermanStemmer, it worked properly.
All the best Philipp
Migrated from LUCENE-87 by Philipp Meister, resolved May 27 2006 Environment:
Attachments: ASF.LICENSE.NOT.GRANTED--CorrectGermanAnalyzer.java, ASF.LICENSE.NOT.GRANTED--CorrectGermanStemmer.java, ASF.LICENSE.NOT.GRANTED--GermanAnalyzer.diff, ASF.LICENSE.NOT.GRANTED--lucene_german_stemmer.diff