elastic / docs

25 stars 335 forks source link

Advise that ngram token filter acts on characters #96

Closed alexgarel closed 8 years ago

alexgarel commented 8 years ago

I think the (lack of) documentation for ngram token filter is misleading. I was expecting this filter to create ngrams of consecutive tokens, not to create ngrams of characters contained in the token.

I propose to add:

A token filter of type nGram. It creates ngrams from sequences of characters contains in each token.

We could maybe add an example as follow:

With the white space tokenizer, and a token filter with min_gram=2 and max_gram=3, "the house" will give: [th, he, the] [ho, ou, se, hou, ous, use] Note that ngrams of the same word have the same position in the phrase, so above expression would match a match_phrase query on "th ous".

alexgarel commented 8 years ago

Maybe a "see also" https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-shingle-tokenfilter.html would be useful !

clintongormley commented 8 years ago

Hi @alexgarel

This docs issues list is only for issues with the docs build process. issues like this one should be opened on the elasticsearch repo instead. That said, I'm working on a rewrite of the token filters docs regardless so I'll be dealing with this issue when I get there anyway

thanks

alexgarel commented 8 years ago

OK @clintongormley, that's cool. And sorry for the misuse.