MihaiValentin / lunr-languages

A collection of languages stemmers and stopwords for Lunr Javascript library
Other
431 stars 163 forks source link

Problem in spanish: doesn't work if word isn't using accent mark. #59

Open jigarzon opened 4 years ago

jigarzon commented 4 years ago

Let's say I have an index created. the spanish word "Respiración" is stemmed as: "respir"

Thats correct.

Now, I make a search, but the user doesn't use the accent mark, and he types: "respiracion" (without acent on last "o"). So lunr won't stem that word and it will let it as "respiracion", so no matches will be found.

I know that a basis around stemming is that the word is correctly spelled, BUT as nearly no user type accents correctly when searching for a string, this is really making lunr useless for many words.

jigarzon commented 4 years ago

I made a workaround, that is removing accents before stemmer in the pipeline (I remove accents with the use of normalize-strings.

But this also removes lot of benefits from stemming, because those words will never be stemmed.

var normalize = require('normalize-strings');

var normalizeLunrPlugin = function(builder, stemmer) {
  var pipelineFunction = function(token) {
    return token.update(function(word) {
      var normalized = normalize(word);
      return normalized;
    });
  };

  // Register the pipeline function so the index can be serialised
  lunr.Pipeline.registerFunction(pipelineFunction, 'normalizeLunrPlugin');

  // Add the pipeline function to both the indexing pipeline and the
  // searching pipeline
  builder.pipeline.before(stemmer, pipelineFunction);
  builder.searchPipeline.before(stemmer, pipelineFunction);
};
jigarzon commented 4 years ago

My suggestion is that two stemmers, with both accented and no-accented words run in the pipeline, so that the word "respiracion" without accents, that the first stemmer will leave intact, is picked by the second one and stemmed correctly...