Open YeonghyeonKO opened 1 month ago
I agree that these defaults are not practical. The only purpose of this tokenizer that I'm familiar with is speeding up prefix queries. This makes me wonder if we should have an even higher max gram size, e.g. 8 or 10.
In actual Elasticsearch settings, it is common to use values of 8 or 10 for MAX_NGRAM_SIZE
as @jpountz said. Of course, people may have different preferences, but I think it is not a good idea to set both max and min values to 1.
Description
From org.apache.lucene:lucene-analysis-common:9.11.1, the static variable
DEFAULT_MAX_GRAM_SIZE
of EdgeNGramTokenizer is ONE not TWO.Logically, the maximum n-gram size must be >= minGramSize and it's not a problem but NOT PRACTICAL.
Since many libraries(git code: Elasticsearch, OpenSearch) use
NGramTokenizer.DEFAULT_MAX_NGRAM_SIZE
notEdgeNGramTokenizer
's. Will there be any dependency problem in Lucene project as a result of my suggestion?See the below codes: