MihaiValentin / lunr-languages

A collection of languages stemmers and stopwords for Lunr Javascript library
Other
431 stars 163 forks source link

lunr.de demo with unexpected result for umlauts #41

Closed derplakatankleber closed 3 years ago

derplakatankleber commented 7 years ago

I tried your demosite "demo-browser-require.html", but I don't understand the results.

tests: console.log('Search for günstige: ', idx.search('günstige'));// expected resultsize: 1, result: 1 console.log('Search for günstig*: ', idx.search('günstig*'));// expected resultsize: 1, result: 0 console.log('Search for g*nstig*: ', idx.search('g*nstig*'));// expected resultsize: 1, result: 1

source: https://rawgit.com/MihaiValentin/lunr-languages/master/demos/demo-browser-require.html

Did I missunderstood, how to search for words with umlauts, or is it not possible to search with wildcards for words with umlauts?

khawkins98 commented 3 years ago

I also noticed this.

In #66 the approach of replacing:

lunr.de.wordCharacters = "A-Za-züÜÄäÖöß0-9";

Fixes wildcard support.

jonex2 commented 3 years ago

workaround with

lunr.de.wordCharacters = "A-Za-züÜÄäÖöß0-9";

did not work. Opened a new issue

khawkins98 commented 3 years ago

I also wound up changing approaches. I can dig up my code, but I believe what I did was:

  1. Convert the umlaut character to their ae, ue versions
  2. Do the same for the passed search string
khawkins98 commented 3 years ago

Here it is: I basically create a mirror search index without international characters so the user gets success if they use ü or u

// receive a set of text and replace diacritics
// it's a poor man's multi-lingual
function normalizeText(searchIndex) {
  function replaceCharacters(string) {
    var string = string || "";
    // handle some common international string as fuzzy english
    string = string.replace(/\u00c4/g, "A");
    string = string.replace(/\u00dc/g, "U");
    string = string.replace(/\u00d6/g, "O");
    string = string.replace(/\u00fc/g, "u");
    string = string.replace(/\u00e4/g, "a");
    string = string.replace(/\u00f6/g, "o");
    string = string.replace(/\u00df/g, "s");
    string = string.replace(/ae/g, "a");
    string = string.replace(/ue/g, "u");
    string = string.replace(/oe/g, "o");
    string = string.replace(/ss/g, "s");
    string = string.replace(/á/g, "a");

    return string;
  }
  for (const item in searchIndex) {
    if (Object.hasOwnProperty.call(searchIndex, item)) {
      searchIndex[item].multilingualAlternate = replaceCharacters(searchIndex[item].lastName);
      searchIndex[item].multilingualAlternate += " " + replaceCharacters(searchIndex[item].firstName);
    }
  }
  return searchIndex;
}

I'm sure it's terrible for performance, but for our use case the dataset was small enough that it didn't matter.

jonex2 commented 3 years ago

@khawkins98 Thank you very much for the quick answer and your new workaround!