apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.68k stars 1.03k forks source link

Should EdgeNGramTokenizer's DEFAULT_MAX_GRAM_SIZE be ONE? #13802

Open YeonghyeonKO opened 1 month ago

YeonghyeonKO commented 1 month ago

Description

From org.apache.lucene:lucene-analysis-common:9.11.1, the static variable DEFAULT_MAX_GRAM_SIZE of EdgeNGramTokenizer is ONE not TWO.

Logically, the maximum n-gram size must be >= minGramSize and it's not a problem but NOT PRACTICAL.

Since many libraries(git code: Elasticsearch, OpenSearch) use NGramTokenizer.DEFAULT_MAX_NGRAM_SIZE not EdgeNGramTokenizer's. Will there be any dependency problem in Lucene project as a result of my suggestion?

See the below codes:

public class EdgeNGramTokenizer extends NGramTokenizer {
    public static final int DEFAULT_MAX_GRAM_SIZE = 1;    /* How about changing '1' to '2'? */
    public static final int DEFAULT_MIN_GRAM_SIZE = 1;

    public EdgeNGramTokenizer(int minGram, int maxGram) {
        super(minGram, maxGram, true);
    }

    public EdgeNGramTokenizer(AttributeFactory factory, int minGram, int maxGram) {
        super(factory, minGram, maxGram, true);
    }
}
jpountz commented 1 month ago

I agree that these defaults are not practical. The only purpose of this tokenizer that I'm familiar with is speeding up prefix queries. This makes me wonder if we should have an even higher max gram size, e.g. 8 or 10.

YeonghyeonKO commented 1 month ago

In actual Elasticsearch settings, it is common to use values ​​of 8 or 10 for MAX_NGRAM_SIZE as @jpountz said. Of course, people may have different preferences, but I think it is not a good idea to set both max and min values ​​to 1.