Closed 5hirish closed 6 years ago
Refer to choose from different stemmers available in ES: Choosing a stemmer in ES How to override the stopwords list in ES: Using a custom stop words list. We can use the stop words from spaCy: All stopwords here
Good choice would be Porter2 Algorithm: Snowball Porter2
Should we Enable ASCII Folding: More on it here. If yes should we store the original too.
The in-built English analyzer for Elasticsearch seems to be using a weak stemmer (Porter Stemmer). So for a token like 'friendly' would get stemmed to 'friendli' and not 'friend'. A Lemmatizer would actually be perfect in such use cases.
Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. Source