apache / lucene

Apache Lucene open-source search software

https://lucene.apache.org/

Apache License 2.0

2.69k stars 1.04k forks source link

Most of the contributed Analyzers suffer from invalid recognition of acronyms. [LUCENE-1373] #2447

Closed asfimport closed 15 years ago

asfimport commented 16 years ago

2145 describes a bug in StandardTokenizer whereby a string like "www.apache.org." would be incorrectly tokenized as an acronym (note the dot at the end).

Unfortunately, keeping the "backward compatibility" of a bug turns out to harm us. StandardTokenizer has a couple of ways to indicate "fix this bug", but unfortunately the default behaviour is still to be buggy. Most of the non-English analyzers provided in lucene-analyzers utilize the StandardTokenizer, and in v2.3.2 not one of these provides a way to get the non-buggy behaviour :(

I refer to:

BrazilianAnalyzer
CzechAnalyzer
DutchAnalyzer
FrenchAnalyzer
GermanAnalyzer
GreekAnalyzer
ThaiAnalyzer

Migrated from LUCENE-1373 by Mark Lassau, resolved Oct 22 2009 Attachments: LUCENE-1373.patch Linked issues:

3077
- 2477
- 2145
- 2228

asfimport commented 16 years ago

Mark Lassau (migrated from JIRA)

I would be willing to contribute a patch to make these Analyzers work in the next point release.

I see two ways to do this: 1) Introduce a static method to StandardTokenizerImpl, whereby you could set the "default" value of the replaceInvalidAcronym flag. One could then call setDefaultForReplaceInvalidAcronym(true) one time from your code, and then whenever anyone uses the old Constructor, it would set replaceInvalidAcronym=true 2) Add the replaceInvalidAcronym flag to all of the above Analyzers. Some of these have multiple constructors already, so I would probably just add a setter/getter to them.

The question is, which of the above would be preferred? Personally, I think the first is the least amount of work to do, and also the easiest to back out when you move onto v3.x, and the "deprecated" behaviour is removed. However, doing 2) means the least disruption to core code.

Also, judging by the "Fix Version/s" field above, I am guessing that a v2.3.3 release is planned, therefore I guess I should provide a patch for the 2.3 branch as well as trunk which will end up as 2.4?

asfimport commented 16 years ago

Mark Lassau (migrated from JIRA)

Causes JIRA issue JRA-15484.

asfimport commented 16 years ago

Mark Lassau (migrated from JIRA)

Had a closer look at the code, including changes in StandardAnalyzer. The static default idea would need a reworking of StandardAnalyzer.reusableTokenStream(), and so I think it is safer to just add the replaceInvalidAcronym flag to the affected Analyzers.

asfimport commented 16 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

I think you should mirror what is done in StandardAnalyzer. You probably could create an abstract class that all of them inherit to share the common code.

Of course, it's still a bit weird, b/c in your case the type value is going to be set to ACRONYM, when your example is clearly not one. This suggests to me that the grammar needs to be revisited, but that can wait until 3.0 I believe.

asfimport commented 16 years ago

Mark Lassau (migrated from JIRA)

Just discovered #2228, which attempts to make StandardAnalyzer NOT be buggy by default. I think if the changes made to StandardAnalyzer here where moved to StandardTokenizer instead, then we would fix this issue.

asfimport commented 16 years ago

Mark Lassau (migrated from JIRA)

Added a draft patch to fix the default behaviour of StandardTokenizer. This basically involved moving the logic of #2228 from StandardAnalyzer to StandardTokenizer.

I added a unit test for StandardTokenizer, but unfortunately don't have time to add tests for the language analyzers listed above (FrenchAnalyzer etc...).

I will be away for 3 weeks, so if anyone else wants to pick up this issue, that would be great ;) ... otherwise I will come back and look at it then.

asfimport commented 15 years ago

Rob ten Hove (migrated from JIRA)

Is it possible that when a property has a value that ends on "Type" like "InputFileType" is not indexed when the OS language is Dutch due to the same bug? I have two installations of Alfresco 3 Labs with Lucene 2.1.0 autoinstalled and with exactly the same installation options (English as language for Alfresco) the main difference next to the Hardware is the OS language. In both cases XP with SP2 but one English and the other Dutch. In the installation on the Dutch OS three properties with values ending on Type could not be found whereas they are present in the English version.

asfimport commented 15 years ago

Mark Lassau (migrated from JIRA)

@Rob This issue is about how Lucene parses ACRONYM tokens, which must contain a dot (eg "I.B.M."), and so you problem is certainly not exactly the same.

Whether it is related to some other issue with Lucene analysers for different languages is not clear. It depends on the workings of your application, and I would suggest you contact the Alfresco developers with this question.

asfimport commented 15 years ago

Rob ten Hove (migrated from JIRA)

@Mark, thanks for your reply on my question. So far the developers that worked on the application I was talking about were able to find a workaround. One thing is certain: the token analyzer mistreats the content... whether the content is an acronym or just plain text... seems that it tries to interpret the content of database elements a bit too much rather than just treat it as plain content...

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Dup of #3077.

apache / lucene

Most of the contributed Analyzers suffer from invalid recognition of acronyms. [LUCENE-1373] #2447

2145 describes a bug in StandardTokenizer whereby a string like "www.apache.org." would be incorrectly tokenized as an acronym (note the dot at the end).

3077

2477

2145

2228