apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.72k stars 1.04k forks source link

NGramTokenizer strips whitespace, with no option to keep leading and trailing whitespace [LUCENE-3979] #5052

Open asfimport opened 12 years ago

asfimport commented 12 years ago

org.apache.lucene.analysis.ngram.NGramTokenizer removes whitespace, making a search for literal strings like " test" and "test " equivalent to "test". Searching with relevant whitespace is sometimes desired, particularly where ngrams are used.

This could be fixed by either removing .trim() from the line shown below, or by providing a flag to specifically set trimming behaviour (keeping trim=true as the default so that existing code using this analyzer is not broken).

111: inStr = new String(chars).trim(); // remove any trailing empty strings


Migrated from LUCENE-3979 by David Mason Environment:

n/a
asfimport commented 12 years ago

David Mason (migrated from JIRA)

I'm happy to submit a patch for this, but haven't done so for this or similar projects so will take a while to go through the wiki and get set up to make a patch.